Audio, Voice and Video: Current State and Limits â€" Xap.es

Audio and video are the modalities where AI has advanced most dramatically in the last two years — and where the gap between what is technologically possible and what is ethically frictionless is largest. This chapter describes the current state: what works, what does not, and what questions are worth asking before using these capabilities.

Transcription: what already works well

Automatic transcription of audio to text has reached a level of maturity that makes it practically indispensable in many professional workflows.

Whisper (OpenAI) is the reference model. It works across more than 100 languages, produces transcriptions with speaker identification (in adapted versions), and its accuracy on reasonable-quality audio is exceptional. It is available as an API and as the base of dozens of applications.

Use cases that already make sense at scale:

Meetings. Tools like Otter.ai, Fireflies or Teams/Zoom with integrated transcription automatically convert meetings into text, with speaker identification and automatic summaries. The friction of taking notes disappears.

Interviews and podcasts. Transcribing a one-hour interview that previously required 3–4 hours of manual work now takes minutes. Human time is reduced to reviewing and correcting.

Video content. Automatic subtitling, transcription for SEO, extracting quotes. Video becomes indexable and searchable content.

Accuracy is not perfect: proper names, technical jargon and audio with a lot of background noise are still more difficult. But the starting point that automatic transcription offers — even imperfect — reduces total work by 70–80%.

Voice synthesis

Voice synthesis — converting text to spoken audio — has made a qualitative leap. Synthesised voices from current systems are indistinguishable from a real human voice for most untrained listeners.

ElevenLabs is the quality standard. It has a catalogue of pre-configured voices and allows cloning your own voice. The naturalness of the prosody — the patterns of emphasis, pauses and intonation — is the aspect that has improved most compared to previous generations.

Uses that work well:

Narration of educational or corporate videos (instead of human recording)
Generating product demos in multiple languages
Accessible content (audiobooks, audio materials for people with visual difficulties)
Rapid prototyping of podcasts or audio products

Real limitation: Voice synthesis is very good in English and most major languages, but quality drops significantly in languages less represented in training. Prosody remains the hardest part: in natural conversations, with interruptions and emotions, synthesis still sounds artificial.

Voice cloning

Voice cloning allows replicating a specific person’s voice — their timbre, their accent, their speech patterns — from a relatively small audio sample. Some systems require only a few minutes of reference audio.

Legitimate uses: Dubbing your own content in multiple languages, restoring voice for people who have lost it through illness, personalising voice assistants.

The problem: The same technology that clones your voice to dub a video in another language can be used to produce fake audio of anyone saying things they never said. The technical threshold for doing this is low. The implications — disinformation, phone fraud, blackmail — are obvious.

Responsible systems include digital watermarks in generated audio, but detection is not perfect and open-source models do not include those protections.

Video generation

Video generation from text or image is the area where progress is fastest and results most impressive, but also where limitations are still most visible.

Sora (OpenAI), announced in 2024, produced videos of up to one minute with convincing visual coherence and camera movement. Runway, Pika, Kling and others offer accessible video generation with results that would have been impossible two years ago.

What works:

Generation of short clips (3–10 seconds) with text prompts
Animating static images
Extending or expanding existing videos
Generating backgrounds or b-roll for production videos

What still fails:

Consistency of objects and people across longer sequences (something appears and disappears, changes shape)
Realistic physics of complex interactions (liquids, clothing, hair)
Reliable generation of visible text in video
Fine control of specific movements

Video avatars. Systems like HeyGen or Synthesia allow creating video presentations with a photorealistic avatar that “speaks” the text you provide. They are already used in corporate training and marketing content. Synthetic video detectors are less effective than users assume.

The questions it raises

Voice synthesis and video generation raise questions that go beyond the correct use of a tool:

Consent. Cloning someone’s voice or image without their consent to produce content they never produced is a use that technology makes possible but that raises obvious problems of privacy and dignity.

Detection. Audio and video synthetic detectors improve, but they always lag behind the generating models. The technological race between generation and detection has no stable winner.

Disinformation. Audio and video deepfakes are already used in political disinformation campaigns. The cost of production has fallen dramatically. Verifying the authenticity of videos is becoming a necessary skill.

The practical advice: use these tools for your own content, with your own voice or images with appropriate licences, with transparency about the use of AI when relevant. The questions they raise are not reasons to avoid them, but they are reasons to use them with judgement.

Audio, Voice and Video: Current State and Limits

Transcription: what already works well

Voice synthesis

Voice cloning

Video generation

The questions it raises

Keep reading

The Art of Saying No: Productivity Through Elimination

The tutorial trap: why watching more courses does not guarantee learning

The weekly review: the habit that gives you back control of your time