ElevenLabs Cheat Sheet

Current quick reference for the ElevenLabs platform in 2026

This cheat sheet covers the current ElevenLabs stack: text-to-speech (TTS), model selection, voice library and cloning, API usage, streaming, language support, and where agents and speech-to-text fit into the platform.

Reading time:≈ 7 minutes
Last updated:March 30, 2026

Current ElevenLabs Platform

ElevenLabs is no longer just a text-to-speech tool. The platform now spans TTS, speech-to-text, voices, agents, and other audio workflows.

Text to Speech

Core TTS models include eleven_flash_v2_5, eleven_multilingual_v2, and eleven_v3.

Speech to Text

Scribe v2 and realtime STT support transcription, timestamps, captions, and meeting-style audio workflows.

Voices

Use Voice Library, Instant Voice Cloning, Professional Voice Cloning, and Voice Design depending on speed, quality, and sharing needs.

Agents

ElevenLabs now supports agent-style, low-latency voice experiences where TTS and realtime audio matter more than one-off generation.

Music & Sound

The platform also covers music and sound effects, so ElevenLabs now sits closer to a broader audio generation stack than a TTS-only API.

Dubbing & Cleanup

Dubbing, voice isolation, and related audio cleanup workflows are now part of the product surface many users expect from ElevenLabs.

What changed since older cheat sheets?

  • Older v1-era references are outdated.
  • Turbo-first recommendations are no longer the best default framing.
  • The product is broader now: TTS, STT, voices, agents, and audio workflows.

Quick Start Guide

Get results in 5 minutes

1

Complete Working Example

Copy this cURL command, replace API_KEY and run it to generate your first voice:

curl -X POST https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM \
  -H "xi-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Welcome to ElevenLabs. This is your first generated voice.",
    "model_id": "eleven_flash_v2_5",
    "voice_settings": {
      "stability": 0.5,
      "similarity_boost": 0.75
    }
  }' \
  --output speech.mp3
Voice ID used:21m00Tcm4TlvDq8ikWAM (Rachel - default female voice)
Get your API key: elevenlabs.io → Developer → API Key
2

Choose Your Model

Need emotion & drama?
eleven_v3
Dialogue, characters, storytelling
Need speed (real-time)?
eleven_flash_v2_5
Chatbots, live agents (75ms latency)
Need quality & languages?
eleven_multilingual_v2
29 languages, premium quality
Need a safe default?
eleven_flash_v2_5 or eleven_multilingual_v2
Flash for speed, Multilingual for polished narration
3

Tune Voice Settings

Stability0.0 - 1.0
Low (0.3): Varied & expressive
High (0.8): Consistent & predictable
Similarity Boost0.0 - 1.0
Higher = Closer to original voice
Sweet spot: 0.75 for most cases
Style Exaggeration0.0 - 1.0
Amplifies voice's natural style
Start at 0.0, increase if needed

Models Comparison

Choose the right model for your use case

Model Best For Latency Languages Char Limit Key Features
Flash v2.5
eleven_flash_v2_5
Real-time applications
Chatbots, live agents, conversational AI
75ms
Ultra-fast
32 languages
40,000
~40 min audio
• Lowest latency
• Optimized for speed
• WebSocket streaming
• Good quality/speed balance
Multilingual v2
eleven_multilingual_v2
Premium quality
Audiobooks, podcasts, videos
~1-2s
Slower
29 languages
Premium quality
10,000
Longer-form narration
• Highest quality
• Best for long-form content
• Natural prosody
• Accent preservation
Eleven v3
eleven_v3
Emotional & expressive
Storytelling, character voices, drama
~1-2s
Slower
70+ languages
Most supported
5,000
Designed for expressive output
• Multi-speaker dialogue
• Audio tags [laughs] [whispers]
• Most emotional depth
• Character acting
Scribe v2
scribe_v2
Speech-to-text
Transcription, timestamps, diarization
~150ms realtime
Realtime variant available
90+ languages
Most coverage
Audio file based
• Converts speech to text
• Word-level timestamps
• Speaker diarization
• High accuracy
Need Speed?
Real-time, conversational AI, low latency required
→ Flash v2.5
Need Quality?
Audiobooks, premium multilingual content, natural prosody
→ Multilingual v2
Need Emotion?
Drama, character voices, expressive storytelling
→ Eleven v3
Need STT?
Transcription, realtime captions, meeting audio
→ Scribe v2 / Realtime

Text Control & Formatting

Control timing, pronunciation, and speech patterns

Pauses & Timing

Break Tags (up to 3 seconds)
<break time="1.5s" />
Insert natural pauses between sentences
Dots for Endpoints
Sentence. <break time="2s" /> Next.
Combine with dots for cleaner breaks
Warning: Too many breaks = instability

Pronunciation Control

CMU Arpabet (Recommended)
<phoneme alphabet="cmu-arpabet" ph="M AE1 D IH0 S AH0 N">Madison</phoneme>
IPA (International)
<phoneme alphabet="ipa" ph="ˈæktʃuəli">actually</phoneme>
Compatible: Flash v2, Turbo v2, English v1

Alias Tags

For Other Models (v3, Multilingual)
<lexeme> <grapheme>UN</grapheme> <alias>United Nations</alias> </lexeme>
Case-sensitive
Exact matching
Context-aware
Understands usage

Speed & Pacing

Speed Range0.7 - 1.2
Slow (0.7)Deliberate narration
Normal (1.0)Default pace
Fast (1.2)Quick delivery
Use narrative style in text for natural pacing instead of extreme speeds

Pronunciation Dictionaries (.PLS)

What They Are
  • Upload custom pronunciation files
  • Reusable across multiple requests
  • Combine phonemes + aliases
  • Case-sensitive matching
Useful Tools
Sequitur G2P
Grapheme-to-phoneme conversion
Phonetisaurus
Phoneme sequence modeling
eSpeak
Text-to-phoneme synthesis

Emotion & Expression

Make AI voices sound more human and expressive without overfitting to fragile prompt tricks

Narrative Context

Add emotional context around dialogue for natural expression

Sadness Example
"You're leaving?" she asked, her voice trembling with sadness.
Triumph Example
"That's it!" he exclaimed triumphantly.
Works with all models - most reliable method for emotion

v3 Audio Tags

Special tags for Eleven v3 model only

Voice Tags
[laughs][whispers][sighs][exhales]
[sarcastic][curious][excited][crying]
Expressive Tags
[applause][clapping][gasps]

Special Tags & Effects

Accent Control
[strong French accent] Bonjour!
Useful for experimentation, but validate results per voice and model.
Other Special Tags
[sings]Singing voice

Punctuation Techniques

CAPITALS for Emphasis
This is AMAZING!
Adds stress and volume
Ellipses for Weight
I don't know... maybe...
Creates hesitation and pauses
Question Marks
You did what?
Natural upward inflection

Eleven v3 Mode Settings

Choose the right mode for your emotional needs

Creative
• Most emotional
• Prone to hallucinations
• Best for dramatic content
Use for: Character acting, storytelling
Natural
• Balanced approach
• Closest to original voice
• Good responsiveness
Use for: General purpose, audiobooks
Robust
• Very stable output
• Less responsive to prompts
• Consistent quality
Use for: Technical content, narration

Voice Selection & Settings

Choose the right voice source, clone type, and settings for your use case

Current ElevenLabs Voice Stack

Voice Library
Browse marketplace and prebuilt voices for fast testing.
IVC
Instant Voice Cloning for fast turnaround from a small sample.
PVC
Professional Voice Cloning for the highest quality and shareable clones.
Voice Design
Create a new synthetic voice instead of cloning an existing speaker.

Voice Library

Pre-built Collections
Browse prebuilt and community voices in the Voice Library marketplace
Emotionally diverse range
Professional quality
Ready to use
Popular Voices
Rachel - Neutral female
Adam - Deep male
Bella - Soft female
Tip: use the library for rapid testing before investing in a custom clone.

Voice Cloning Types

IVC (Instant Voice Cloning)
Upload 1-2 minutes of audio, instant results
Best for quick prototyping and internal testing
PVC (Professional Voice Cloning)
3+ hours of audio, highest quality
Studio-grade quality
Longer processing time
Shareability
Professional Voice Clones can be shared in the Voice Library; treat IVC as a faster, lighter path.

Voice Settings Presets

Narration
Stability0.6-0.8
Similarity0.7
Style0.2-0.4
Audiobooks, documentation, training content
Customer Support
Stability0.7-0.9
Similarity0.9
Style0.0-0.2
Clear, consistent voice for customer communication
Characters
Stability0.3-0.5
Similarity0.8
Style0.7-1.0
Dramatic, highly expressive character voices

Speaker Boost

What It Does
Enhances voice clarity and presence, reduces artifacts
When to Use
Professional recordings
Audiobooks & podcasts
Not for real-time (adds latency)

Selection Tips

Test Multiple Voices
Same text sounds different across voices
Match Content Type
Formal content needs neutral voices
Consider Audience
Age, region, and preferences matter
Regenerate Feature
Generate a few takes before locking a voice for production

WebSocket Streaming

Real-time audio generation patterns for low-latency ElevenLabs workflows

Connection Setup

WebSocket Endpoint
wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input
Required Headers
xi-api-key: YOUR_API_KEY
Use Flash v2.5 for best streaming performance. Confirm endpoint parameters against current docs before production rollout.

Message Protocol

Send Configuration (JSON)
{ "text": "Hello world", "model_id": "eleven_flash_v2_5", "voice_settings": { "stability": 0.5, "similarity_boost": 0.75 } }
Receive Audio Chunks
Audio chunks stream back incrementally; exact framing and formats can vary by integration.

Chunk Management

Send Text in Chunks
ws.send(JSON.stringify({ text: "First chunk. " })) ws.send(JSON.stringify({ text: "Second chunk." }))
Stream text incrementally for faster response, ideally at phrase or sentence boundaries.
Chunking Guidance
Smaller chunksLower latency
Sentence-sized chunksBest balance
Very large chunksHigher latency

Flushing & Completion

Flush Message
ws.send(JSON.stringify({ flush: True }))
Flush any remaining buffered text for a clean ending.
Completion Signal
Watch for the service completion event defined in the current WebSocket docs.

Buffering Strategy

Initial Buffer
Wait ~250ms before playback to avoid stutter
Rolling Buffer
Always keep 150-200ms of audio ahead
Backpressure
Pause sending text if playback lags

Error Handling

Common Errors
  • 401 - Invalid API key
  • 429 - Rate limited
  • 400 - Invalid chunk format
Retries
Exponential backoff (250ms → 4s)

Performance Tips

Minimize Latency
  • Use Flash v2.5 model
  • Send 200-500 char chunks
  • Flush at sentence breaks
  • Keep connection alive
Optimize Quality
  • Buffer 300-500ms initially
  • Use complete sentences
  • Avoid mid-word chunks
  • Proper punctuation
Handle Errors
  • Implement reconnection
  • Monitor connection health
  • Handle rate limits
  • Log failed chunks

API Quick Reference

Essential endpoints, model IDs, and current usage patterns

Text-to-Speech (POST)

Endpoint
POSThttps://api.elevenlabs.io/v1/text-to-speech/{voice_id}
Required Headers
xi-api-key: YOUR_API_KEY
Content-Type: application/json
Request Body
{ "text": "Your text here", "model_id": "eleven_flash_v2_5", "voice_settings": { "stability": 0.5, "similarity_boost": 0.75, "style": 0.0, "use_speaker_boost": true } }

Python Example

import requests

url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}"

headers = {
    "xi-api-key": API_KEY,
    "Content-Type": "application/json"
}

data = {
    "text": "Hello world",
    "model_id": "eleven_flash_v2_5"
}

response = requests.post(url, json=data, headers=headers)

with open("output.mp3", "wb") as f:
    f.write(response.content)

JavaScript Example

const response = await fetch(url, {
  method: 'POST',
  headers: {
    'xi-api-key': API_KEY,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ text: 'Hello world', model_id: 'eleven_flash_v2_5' })
});

const audioBuffer = await response.arrayBuffer();
const blob = new Blob([audioBuffer], { type: 'audio/mpeg' });
const urlObject = URL.createObjectURL(blob);

List Voices (GET)

GEThttps://api.elevenlabs.io/v1/voices
Returns all available voices with metadata

List Models (GET)

GEThttps://api.elevenlabs.io/v1/models
See available model IDs and capabilities

Speech-to-Text (POST)

POSThttps://api.elevenlabs.io/v1/speech-to-text
Use Scribe v2 for transcription, timestamps, and diarization workflows.

Voice Cloning (POST)

POSThttps://api.elevenlabs.io/v1/voices/add
  • Upload audio files
  • Provide descriptive name
  • Choose cloning type (IVC/PVC)
  • Optional: description, labels

Current Model IDs to Know

Fast TTS
eleven_flash_v2_5
Premium TTS
eleven_multilingual_v2
Expressive TTS
eleven_v3

Language Support

Current language coverage depends on model, with Eleven v3 offering the broadest TTS support

Language Coverage by Model

Flash v2.5
32
Major global languages
Multilingual v2
29
Premium quality support
Eleven v3
70+
Most comprehensive

Language Codes

{ "text": "Hola, ¿cómo estás?", "model_id": "eleven_multilingual_v2", "language_code": "es" }
Set language_code when the endpoint and model support it for more predictable output.

Multilingual Tips

Consistency
Keep punctuation style consistent with the language
Context
Provide regional context (e.g., Mexico vs Spain)
Accents
Prefer voice choice and clean localized text over heavy prompt tricks

Regional Variants

  • Spanish: es-ES, es-MX, es-AR
  • Portuguese: pt-PT, pt-BR
  • English: en-US, en-GB, en-AU
  • French: fr-FR, fr-CA

Code Switching

"Welcome to Mexico City. Bienvenidos a la Ciudad de México."
Use full sentences per language for best effect

Model-Specific Notes

Flash v2.5
  • Best for realtime TTS
  • 32 languages
  • Strong default for agents
  • Use for low-latency playback
Multilingual v2
  • Best for polished multilingual narration
  • 29 languages
  • Better long-form balance
  • Use when quality matters more than speed
Eleven v3
  • Broadest TTS language coverage
  • 70+ languages
  • Best expressive model
  • Use for dialogue and emotional delivery

Special Characters

Accented Characters
á é í ó ú ñ ç ü ö ä
Cyrillic Script
Russian, Ukrainian, Bulgarian
Right-to-Left
Arabic, Hebrew support
Diacritics
Proper pronunciation handling

Automatic Detection

How It Works
Some workflows can infer language automatically, but explicit configuration is safer.
Works best for common, clean inputs
Less accurate than explicit code
Always prefer language_code

Troubleshooting

Common issues and solutions

Audio Quality Issues

Robotic/Unnatural Sound
  • Lower stability (try 0.3-0.5)
  • Increase similarity boost
  • Add narrative context
  • Try different voice
Inconsistent Output
  • Increase stability (0.7-0.8)
  • Switch from v3 to Flash or Multilingual for more stable output
  • Remove excessive breaks
  • Simplify text formatting
Artifacts/Glitches
  • Enable speaker boost
  • Reduce style exaggeration
  • Fix text punctuation
  • Regenerate (use 3x feature)

Voice Cloning Problems

Poor Voice Match
  • Use 2+ minutes of audio
  • Ensure clean recording
  • Single speaker only
  • No background noise
Clone Sounds Different
  • Match audio style to use case
  • Adjust similarity boost
  • Try PVC instead of IVC
  • Test clone quality across current TTS models
Upload Fails
  • Check file format (MP3/WAV)
  • Max 10 files per upload
  • 1-2 min total duration
  • Remove silence/pauses

Pronunciation Issues

Mispronounced Names
  • Use phoneme tags
  • Add pronunciation dictionary
  • Provide phonetic hints
  • Spell out syllables
Foreign Words
  • Set language_code
  • Use narrative context
  • Try accent tags
  • Break into syllables

Latency Issues

Slow Responses
  • Use Flash v2.5
  • Shorten text chunks
  • Avoid complex tags
  • Use WebSocket streaming
Queue Delays
  • Retry after delay
  • Check status page
  • Check plan limits and current service guidance
  • Batch non-urgent jobs

Emotion Not Working

Flat Delivery
  • Add contextual cues
  • Use narrative descriptions
  • Lower stability
  • Use v3 tags
Overacting
  • Increase stability
  • Reduce style exaggeration
  • Use calmer language
  • Switch to Flash or Multilingual for more controlled output

Character Limit Issues

Text Too Long
  • Flash v2.5: 40k limit
  • Multilingual v2: 10k limit
  • Eleven v3: 5k limit
  • Split by sentences
Streaming Overflows
  • Send 200-500 char chunks
  • Flush after sections
  • Use queue for long form
  • Store progress markers

ElevenLabs FAQ

Fast answers to common ElevenLabs questions people search before using the API or picking a model.

What is the best ElevenLabs model right now?

It depends on the job: use Flash v2.5 for low latency, Multilingual v2 for polished multilingual narration, and Eleven v3 for expressive dialogue or character-style output.

Is ElevenLabs only for text to speech?

No. ElevenLabs now spans text-to-speech, speech-to-text, voice cloning, voice design, streaming, agents, dubbing, and other audio workflows.

Which ElevenLabs model should I use for real-time apps?

Start with eleven_flash_v2_5 for low-latency generation, especially when building chat, assistant, or agent-style voice interfaces.

What is the difference between IVC and PVC?

Instant Voice Cloning is faster and lighter for quick tests, while Professional Voice Cloning is the higher-quality path for serious production workflows and shareable cloned voices.

Does ElevenLabs support multiple languages?

Yes. Coverage varies by model, with Eleven v3 offering the broadest TTS language support, while Multilingual v2 remains a strong choice for high-quality multilingual narration.