What is new in ElevenLabs V3

For anyone who has worked with AI-generated audio, the name ElevenLabs likely comes up. Their previous models set a high bar for quality. With the release of Eleven v3, we're seeing a major shift in what is possible. This is not just a minor update with a few new voices; it represents a different approach to how AI handles the spoken word.

Emotional Range of ElevenLabs v3

Experience the nuanced, hesitant delivery that showcases v3's performance capabilities

From Narration to Performance

What is the big change? Previous text-to-speech models, even highly advanced ones, focused on producing clear and natural-sounding speech. They were good at reading. Eleven v3 is built for performance. It works to interpret the emotional subtext of a script, which allows for a delivery with more character, feeling, and timing. We're moving away from simple narration and into the area of directed, AI-driven voice acting. This results in benefits such as more believable character interactions and more engaging long-form content.

What We'll Cover

Over the next few sections, we'll look at the tools v3 provides. We will explore the new Text to Dialogue API, see how 'Audio Tags' give you director-level control over a performance, and check out the settings that help you get the best results. We will also cover some suitable scenarios where v3 really makes a difference, giving you a clear picture of how to use these new capabilities in your own projects.

New Controls for Emotion and Dialogue

The biggest update in v3 is its ability to handle emotional context. This moves the model from being a simple text reader to something more like a voice director. It is less about the clarity of the words and more about the intent behind them. This is managed through a few major new features that give you a high degree of control over the final audio.

A Deeper Emotional Range

With previous models, the emotion was often tied to the chosen voice. If you wanted an energetic delivery, you needed an energetic voice clone. V3 is different. It is designed to generate speech with a wider emotional spectrum. It can pick up on contextual cues in the text to deliver a more nuanced performance.

For example, the sentence, "I can't believe you did that," can be delivered in completely different ways based on the surrounding script.

Awe: "That trick was amazing. I can't believe you did that." The delivery will shift to sound impressed and excited.
Anger: "I trusted you. I can't believe you did that." The model will add an accusatory, sharp tone.

Creating Conversations with the Text to Dialogue API

A completely new addition is the Text to Dialogue API. This tool is built specifically for creating conversations between multiple characters. Instead of generating single lines of audio and piecing them together, you can now structure a script with different speakers and have v3 generate the entire exchange. This is a major step for anyone producing content with multiple interacting characters, such as podcasts, animated shows, or game dialogue. It allows for a more natural flow and interaction between the voices.

Directing the Voice with Audio Tags

Perhaps the most hands-on new feature is the introduction of Audio Tags. These are simple, bracketed commands you place directly into your script to guide the model's delivery on a moment-to-moment basis. This gives you a level of control that was not possible before.

Let's look at a line of dialogue without any tags: I'm not sure about this path. It looks steep.

Without Audio Tags

Simple, flat delivery without emotional direction

Now, let's add Audio Tags to give the line a specific character and emotional context: [hesitantly] I'm not sure about this path. [with a gulp] It looks steep.

With Audio Tags

Hesitant delivery with emotional nuance

The result is no longer a simple statement but a moment of character expression. Here are some of the tags you can use.

Vocal Cues and Emotions: You can now direct a voice's delivery and emotional state. For example, you can make a character whisper by adding [whispers] or add a reaction like [sighs] or [laughs]. There are also tags for more complex emotions, such as [sarcastic], [curious], or [crying].
Atmospheric and Sound Effects: Beyond just vocal direction, you can insert non-verbal sounds into the audio. Think of adding [applause] after a big speech or a [gunshot] for a dramatic moment. These tags help build the world around the dialogue.
Experimental and Creative Tags: ElevenLabs has also included some experimental tags that push the creative boundaries. You can try to add a specific accent with [strong X accent] or even make a character sing with [sings]. These are more experimental and may not work consistently across all voices, but they open up interesting possibilities for unique audio creation.

Fine-Tuning Your Audio with Voice Selection and Settings

Getting the right output is a combination of choosing the right voice and adjusting a few key settings. Eleven v3 provides several tools to help you get the exact delivery you are looking for.

The Importance of the Right Voice

In v3, the voice you select is more important than ever. Because the model can generate a wider range of emotions, the base voice acts as a starting point. Its core characteristics will always influence the final performance. Think of it this way: if your chosen voice is naturally calm, asking it to shout with a [shout] tag will probably not produce a convincing result. The model will try, but it is working against the voice's original nature. For this reason, it is a good idea to test a few different voices from the library to find one that aligns with the emotional tone of your project. For now, Instant Voice Clones (IVCs) or pre-designed voices are recommended, as Professional Voice Clones (PVCs) are still being optimized for v3.

The Stability Slider Explained

The most direct way to control the AI's performance is with the stability slider. This setting determines how much freedom the model has to be expressive versus how closely it sticks to the original voice's characteristics.

Let's use this line as a test: Wow, look at that view, it's incredible!

Creative: At the lower end of the slider, this setting gives the model the most freedom. You will get more emotional and expressive deliveries. The AI might add a sigh or a slight laugh, stretching the word "Wow" for emotional effect.
Natural: This is the middle ground. It aims to produce a voice that is balanced and sounds closest to the original voice recording. It offers a good mix of expressiveness and consistency.
Robust: At the higher end, this setting makes the voice highly stable and consistent. The delivery will be clear but less responsive to directional prompts like Audio Tags.

The Impact of Punctuation

A subtle but effective way to influence the delivery is through your use of punctuation. V3 pays close attention to these small details in your script. For instance, compare these two sentences:

Wait what are you doing

Without Punctuation

Rushed and flat delivery

Wait... what are you DOING?

With Punctuation and Emphasis

Pause, confusion, and urgent emphasis

The first version sounds rushed and flat. The second uses an ellipsis (...) to create a realistic pause of confusion and capitalization to add urgent emphasis on the final word, completely changing the feel of the line. Standard punctuation, like commas and periods, also helps create a natural rhythm and cadence to the speech.

Crafting Conversations with Multi-Speaker Dialogue

One of the most practical new features in Eleven v3 is its native ability to handle multi-speaker conversations. Before, creating a dialogue meant generating each line as a separate audio file and then editing them together. The new Text to Dialogue API lets you generate a complete, flowing conversation in a single pass.

Bringing Characters to Life

The process is straightforward. You structure your script to assign different voices from your library to each character. The model then generates the dialogue, taking into account the context of the conversation.

Here is a short script formatted for the Text to Dialogue API:

[voice: Anna]
Okay, I think I've packed everything. Compass, map, water... [sighs] this backpack is heavy.

[voice: Leo]
[laughs] You say that every time. Did you remember the extra batteries for the flashlight?

[voice: Anna]
They're in the side pocket. Don't worry, I'm prepared for anything this time.

Listen to each character's line:

Anna - Opening Line

Natural delivery with implied exhaustion

Leo - Response

Playful delivery with laugh

Anna - Follow-up

Confident and prepared tone

When generated, this becomes a single audio file. The API creates a cohesive scene where Anna's sigh and Leo's laugh sound like natural reactions within the flow of their conversation. This capability is a major benefit for anyone creating narrative-driven content.

Audio Dramas and Podcasts: Easily produce scenes with multiple characters interacting, complete with emotional nuance and realistic timing.
Game Development: Generate in-game conversations and cutscenes with distinct, believable character voices.
Animations and Film Pre-visualization: Create scratch tracks and animatics with a full cast of voices, helping to establish the tone and pacing of a scene early in production.

Ideal Use Cases and Applications for v3

With its focus on emotional range and multi-speaker conversations, Eleven v3 is a specialized tool. It is designed for projects where performance and believability are top priorities. While other models like Flash v2.5 are built for speed and real-time applications, v3 is at its best when given the space to craft more complex and nuanced audio.

Character Interactions

V3 is highly suitable for any medium that relies on characters interacting with each other. Developers and animators can generate high-quality voiceovers for entire scenes. This is useful for everything from final in-game dialogue to creating scratch tracks for animatics. For podcasts and audio dramas, creators can produce full-cast audio experiences with rich, immersive soundscapes.

The Next Generation of Audiobooks

Long-form narration is another area where v3 offers a major step forward. Audiobook production requires a narrator to maintain a consistent voice while also conveying a wide range of emotions and character voices. V3's ability to interpret and deliver on emotional context makes it a useful tool for this task. It can handle subtle shifts in tone, from a moment of quiet reflection to a tense action sequence.

Believable Dialogue for Any Medium

Beyond entertainment, v3 has practical applications in any area where human-like speech can make content more effective. Corporate videos and training materials often suffer from robotic, monotone narration. Using v3 to create more natural and emotionally varied dialogue can make this content more engaging. Writers and directors can also hear their scripts performed with emotional weight early in the creative process, helping in refining dialogue and checking the pacing of a scene.

The Technical Side of v3

It is also helpful to know a few technical details. These points will give you a clearer picture of what to expect when you start working with the model, especially since it is still in an early stage of development.

The "Alpha" Stage Explained

ElevenLabs has released v3 as an "alpha," which is another way of saying it is a research preview. This has a few practical implications for users.

It is Subject to Change: The model is still being developed. Features could be adjusted, added, or removed in future updates.
Possible Inconsistency: Because it is an alpha, you might find that the output can be inconsistent at times. The official documentation notes that very short prompts are more likely to produce variable results and suggests using prompts longer than 250 characters.
Not for Real-Time Use: V3 is not designed for applications that require instant audio generation, like live voice agents. For those use cases, models like Flash v2.5 remain the better choice.

A Note on Voice Cloning

If you have used Professional Voice Clones (PVCs) with previous ElevenLabs models, there is an important distinction to make with v3. PVCs are not yet fully optimized for this new model, which can lead to a lower quality of voice cloning. For now, the recommendation is to use either Instant Voice Clones (IVCs) or one of the pre-designed voices from the Voice Library when working with v3.

Expanded Language Support

A major technical improvement in v3 is the number of languages it supports. The model can now generate speech in over 70 languages. This is a sizeable increase from the 29 languages supported by the Multilingual v2 model and a major advantage for creators working on multilingual projects.

Practical Tips for Success with v3

Getting the hang of Eleven v3 is less about following a rigid set of rules and more about learning to direct the AI. A bit of thoughtful prompting and experimentation can make a major difference in the quality of your final audio.

The Art of the Prompt

With v3, the text you provide does more than just supply the words; it gives the model context. Longer passages of text give the model a better sense of the emotional arc and pacing, which generally leads to more stable outputs. The official advice is to use prompts of 250 characters or more. For the most natural-sounding delivery, write your text in a way that someone would actually speak.

Layering Audio Tags for Nuance

Audio Tags can be combined to create more complex and layered performances. You can mix emotional states with physical actions to achieve a more specific delivery. For example, instead of just using [nervously], you could try:

"[nervously] I... I don't think this is a good idea. [gulps] Let's just go back."

This combination tells the model to deliver the line with a nervous tone and add a physical swallowing sound, creating a more vivid and believable moment of hesitation.

Match the Direction to the Voice

This point is worth repeating: the base voice you choose sets the stage for the entire performance. You will get the best results when your directions align with the natural characteristics of the voice. If you have selected a voice that is inherently serious and professional, asking it to deliver a line with a [giggles] tag might sound forced. A better approach is to choose a voice that has a more lighthearted quality to begin with.

What's Next for ElevenLabs and AI Voice

The release of Eleven v3 marks a clear direction for the future of AI-generated audio. It is a move away from simply replicating human speech and toward the more complex goal of simulating human performance.

The Road Ahead for v3

Because Eleven v3 is still in a research preview stage, we can expect it to continue changing and improving. The team at ElevenLabs has already noted that optimizations for Professional Voice Clones (PVCs) are on the way. Beyond official updates, the capabilities of v3 will likely expand as the community experiments with it. New and effective combinations of Audio Tags and prompting techniques will almost certainly be discovered by users.

The Broader Impact on Audio Creation

Looking at the bigger picture, tools like v3 are changing the landscape for content creators. The ability to direct an AI performance with a high degree of emotional control lowers the barrier for producing quality audio. Independent game developers, animators, and podcast producers can now access tools that were once only available to large studios with sizeable budgets. As these tools continue to develop, the distinction between a human performance and an AI-generated one will likely become even less clear.