ElevenLabs Cheat Sheet

Complete quick reference guide to AI voice generation

This comprehensive cheat sheet covers everything you need to know about ElevenLabs: text-to-speech (TTS), voice cloning, API usage, model selection, and advanced techniques for creating natural-sounding AI voices.

Reading time:≈ 7 minutes
Last updated:November 11, 2025

Quick Start Guide

Get results in 5 minutes

1

Complete Working Example

Copy this cURL command, replace API_KEY and run it to generate your first voice:

curl -X POST https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM \
  -H "xi-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Welcome to ElevenLabs. This is your first generated voice.",
    "model_id": "eleven_flash_v2_5",
    "voice_settings": {
      "stability": 0.5,
      "similarity_boost": 0.75
    }
  }' \
  --output speech.mp3
Voice ID used: 21m00Tcm4TlvDq8ikWAM (Rachel - default female voice)
Get your API key: elevenlabs.io → Developer → API Key
2

Choose Your Model

Need emotion & drama?
eleven_turbo_v2_5
Dialogue, characters, storytelling
Need speed (real-time)?
eleven_flash_v2_5
Chatbots, live agents (75ms latency)
Need quality & languages?
eleven_multilingual_v2
29 languages, premium quality
Need multi-speaker?
eleven_v3
Conversations, multiple characters
3

Tune Voice Settings

Stability0.0 - 1.0
Low (0.3): Varied & expressive
High (0.8): Consistent & predictable
Similarity Boost0.0 - 1.0
Higher = Closer to original voice
Sweet spot: 0.75 for most cases
Style Exaggeration0.0 - 1.0
Amplifies voice's natural style
Start at 0.0, increase if needed

Models Comparison

Choose the right model for your use case

ModelBest ForLatencyLanguagesChar LimitKey Features
Flash v2.5
eleven_flash_v2_5
Real-time applications
Chatbots, live agents, conversational AI
75ms
Ultra-fast
32 languages
40,000
~40 min audio
• Lowest latency
• Optimized for speed
• WebSocket streaming
• Good quality/speed balance
Turbo v2.5
eleven_turbo_v2_5
Balanced use cases
General purpose, narration, IVR
~300ms
Fast
32 languages
40,000
~40 min audio
• Better quality than Flash
• Still fast enough for real-time
• More emotional range
• Best all-rounder
Multilingual v2
eleven_multilingual_v2
Premium quality
Audiobooks, podcasts, videos
~1-2s
Slower
29 languages
Premium quality
5,000
~5 min audio
• Highest quality
• Best for long-form content
• Natural prosody
• Accent preservation
Eleven v3
eleven_v3
Emotional & expressive
Storytelling, character voices, drama
~1-2s
Slower
70+ languages
Most supported
3,000
~3 min audio
• Multi-speaker dialogue
• Audio tags [laughs] [whispers]
• Most emotional depth
• Character acting
Scribe v1
scribe_v1
Speech-to-text
Transcription, voice cloning prep
Variable
Depends on audio
99 languages
Most coverage
Audio file based
• Converts speech to text
• Supports most languages
• Use before voice cloning
• High accuracy
Need Speed?
Real-time, conversational AI, low latency required
→ Flash v2.5
Need Quality?
Audiobooks, premium content, natural prosody
→ Multilingual v2
Need Emotion?
Drama, character voices, expressive storytelling
→ Eleven v3
Need Balance?
General purpose, good quality and speed
→ Turbo v2.5

Text Control & Formatting

Control timing, pronunciation, and speech patterns

Pauses & Timing

Break Tags (up to 3 seconds)
Insert natural pauses between sentences
Dots for Endpoints
Sentence. Next.
Combine with dots for cleaner breaks
Warning: Too many breaks = instability

Pronunciation Control

CMU Arpabet (Recommended)
Madison
IPA (International)
actually
Compatible: Flash v2, Turbo v2, English v1

Alias Tags

For Other Models (v3, Multilingual)
UN United Nations
Case-sensitive
Exact matching
Context-aware
Understands usage

Speed & Pacing

Speed Range0.7 - 1.2
Slow (0.7)Deliberate narration
Normal (1.0)Default pace
Fast (1.2)Quick delivery
Use narrative style in text for natural pacing instead of extreme speeds

Pronunciation Dictionaries (.PLS)

What They Are
  • Upload custom pronunciation files
  • Reusable across multiple requests
  • Combine phonemes + aliases
  • Case-sensitive matching
Useful Tools
Sequitur G2P
Grapheme-to-phoneme conversion
Phonetisaurus
Phoneme sequence modeling
eSpeak
Text-to-phoneme synthesis

Emotion & Expression

Make AI voices sound human and expressive

Narrative Context

Add emotional context around dialogue for natural expression

Sadness Example
"You're leaving?" she asked, her voice trembling with sadness.
Triumph Example
"That's it!" he exclaimed triumphantly.
Works with all models - most reliable method for emotion

v3 Audio Tags

Special tags for Eleven v3 model only

Voice Tags
[laughs][whispers][sighs][exhales]
[sarcastic][curious][excited][crying]
Sound Effects
[gunshot][applause][clapping]

Special Tags & Effects

Accent Control
[strong French accent] Bonjour!
Replace "French" with any accent
Other Special Tags
[sings]Singing voice
[woo]Exclamation

Punctuation Techniques

CAPITALS for Emphasis
This is AMAZING!
Adds stress and volume
Ellipses for Weight
I don't know... maybe...
Creates hesitation and pauses
Question Marks
You did what?
Natural upward inflection

Eleven v3 Mode Settings

Choose the right mode for your emotional needs

Creative
• Most emotional
• Prone to hallucinations
• Best for dramatic content
Use for: Character acting, storytelling
Natural
• Balanced approach
• Closest to original voice
• Good responsiveness
Use for: General purpose, audiobooks
Robust
• Very stable output
• Less responsive to prompts
• Consistent quality
Use for: Technical content, narration

Voice Selection & Settings

Choose and configure the perfect voice

Voice Library

Pre-built Collections
22+ excellent v3-compatible voices available
Emotionally diverse range
Professional quality
Ready to use
Popular Voices
Rachel - Neutral female
Adam - Deep male
Bella - Soft female

Voice Cloning Types

IVC (Instant Voice Cloning)
Upload 1-2 minutes of audio, instant results
Currently better for v3
PVC (Professional Voice Cloning)
3+ hours of audio, highest quality
Studio-grade quality
Longer processing time

Voice Settings Presets

Narration
Stability0.6-0.8
Similarity0.7
Style0.2-0.4
Audiobooks, documentation, training content
Customer Support
Stability0.7-0.9
Similarity0.9
Style0.0-0.2
Clear, consistent voice for customer communication
Characters
Stability0.3-0.5
Similarity0.8
Style0.7-1.0
Dramatic, highly expressive character voices

Speaker Boost

What It Does
Enhances voice clarity and presence, reduces artifacts
When to Use
Professional recordings
Audiobooks & podcasts
Not for real-time (adds latency)

Selection Tips

Test Multiple Voices
Same text sounds different across voices
Match Content Type
Formal content needs neutral voices
Consider Audience
Age, region, and preferences matter
Regenerate Feature
Use regenerate (3x) for best take

WebSocket Streaming

Real-time audio generation with ultra-low latency

Connection Setup

WebSocket Endpoint
wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream
Required Headers
xi-api-key: YOUR_API_KEY
Use Flash v2.5 for best streaming performance (75ms latency)

Message Protocol

Send Configuration (JSON)
{ "text": "Hello world", "model_id": "eleven_flash_v2_5", "voice_settings": { "stability": 0.5, "similarity_boost": 0.75 } }
Receive Audio Chunks
Binary data (MP3) streams back in real-time

Chunk Management

Send Text in Chunks
ws.send(JSON.stringify({ text: "First chunk. " })) ws.send(JSON.stringify({ text: "Second chunk." }))
Stream text incrementally for faster response
Optimal Chunk Size
Minimum50 chars
Recommended200-400 chars
Maximum1,000 chars

Flushing & Completion

Flush Message
ws.send(JSON.stringify({ flush: True }))
Sends remaining buffers for clean ending
Completion Signal
Empty binary message indicates end of stream

Buffering Strategy

Initial Buffer
Wait ~250ms before playback to avoid stutter
Rolling Buffer
Always keep 150-200ms of audio ahead
Backpressure
Pause sending text if playback lags

Error Handling

Common Errors
  • 401 - Invalid API key
  • 429 - Rate limited
  • 400 - Invalid chunk format
Retries
Exponential backoff (250ms → 4s)

Performance Tips

Minimize Latency
  • Use Flash v2.5 model
  • Send 200-500 char chunks
  • Flush at sentence breaks
  • Keep connection alive
Optimize Quality
  • Buffer 300-500ms initially
  • Use complete sentences
  • Avoid mid-word chunks
  • Proper punctuation
Handle Errors
  • Implement reconnection
  • Monitor connection health
  • Handle rate limits
  • Log failed chunks

API Quick Reference

Essential endpoints and code examples

Text-to-Speech (POST)

Endpoint
POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}
Required Headers
xi-api-key: YOUR_API_KEY
Content-Type: application/json
Request Body
{ "text": "Your text here", "model_id": "eleven_turbo_v2_5", "voice_settings": { "stability": 0.5, "similarity_boost": 0.75, "style": 0.0, "use_speaker_boost": true } }

Python Example

import requests

url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}"

headers = {
    "xi-api-key": API_KEY,
    "Content-Type": "application/json"
}

data = {
    "text": "Hello world",
    "model_id": "eleven_turbo_v2_5"
}

response = requests.post(url, json=data, headers=headers)

with open("output.mp3", "wb") as f:
    f.write(response.content)

JavaScript Example

const response = await fetch(url, {
  method: 'POST',
  headers: {
    'xi-api-key': API_KEY,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ text: 'Hello world' })
});

const audioBuffer = await response.arrayBuffer();
const blob = new Blob([audioBuffer], { type: 'audio/mpeg' });
const urlObject = URL.createObjectURL(blob);

List Voices (GET)

GET https://api.elevenlabs.io/v1/voices
Returns all available voices with metadata

List Models (GET)

GET https://api.elevenlabs.io/v1/models
See available model IDs and capabilities

Voice Cloning (POST)

POST https://api.elevenlabs.io/v1/voices/add
  • Upload audio files
  • Provide descriptive name
  • Choose cloning type (IVC/PVC)
  • Optional: description, labels

Language Support

Multilingual voice generation across 70+ languages

Language Coverage by Model

Flash v2.5 & Turbo v2.5
32
Major global languages
Multilingual v2
29
Premium quality support
Eleven v3
70+
Most comprehensive

Language Codes

{ "text": "Hola, ¿cómo estás?", "model_id": "eleven_multilingual_v2", "language_code": "es" }
Always set language_code for best accuracy

Multilingual Tips

Consistency
Keep punctuation style consistent with the language
Context
Provide regional context (e.g., Mexico vs Spain)
Accents
Use accent tags for better authenticity

Regional Variants

  • Spanish: es-ES, es-MX, es-AR
  • Portuguese: pt-PT, pt-BR
  • English: en-US, en-UK, en-AU
  • French: fr-FR, fr-CA

Code Switching

"Welcome to Mexico City. Bienvenidos a la Ciudad de México."
Use full sentences per language for best effect

Asian Language Notes

Chinese
  • Simplified (zh-CN)
  • Traditional (zh-TW)
  • Tone-aware generation
  • Character recognition
Japanese
  • Hiragana support
  • Katakana support
  • Kanji recognition
  • Pitch accent aware
Korean
  • Hangul support
  • Hanja recognition
  • Formal/informal tones
  • Natural intonation

Special Characters

Accented Characters
á é í ó ú ñ ç ü ö ä
Cyrillic Script
Russian, Ukrainian, Bulgarian
Right-to-Left
Arabic, Hebrew support
Diacritics
Proper pronunciation handling

Automatic Detection

How It Works
Models can auto-detect language without code
Works for major languages
Less accurate than explicit code
Always prefer language_code

Troubleshooting

Common issues and solutions

Audio Quality Issues

Robotic/Unnatural Sound
  • Lower stability (try 0.3-0.5)
  • Increase similarity boost
  • Add narrative context
  • Try different voice
Inconsistent Output
  • Increase stability (0.7-0.8)
  • Use Turbo v2 instead of v3
  • Remove excessive breaks
  • Simplify text formatting
Artifacts/Glitches
  • Enable speaker boost
  • Reduce style exaggeration
  • Fix text punctuation
  • Regenerate (use 3x feature)

Voice Cloning Problems

Poor Voice Match
  • Use 2+ minutes of audio
  • Ensure clean recording
  • Single speaker only
  • No background noise
Clone Sounds Different
  • Match audio style to use case
  • Adjust similarity boost
  • Try PVC instead of IVC
  • Use v3 model for IVC
Upload Fails
  • Check file format (MP3/WAV)
  • Max 10 files per upload
  • 1-2 min total duration
  • Remove silence/pauses

Pronunciation Issues

Mispronounced Names
  • Use phoneme tags
  • Add pronunciation dictionary
  • Provide phonetic hints
  • Spell out syllables
Foreign Words
  • Set language_code
  • Use narrative context
  • Try accent tags
  • Break into syllables

Latency Issues

Slow Responses
  • Use Flash v2.5
  • Shorten text chunks
  • Avoid complex tags
  • Use WebSocket streaming
Queue Delays
  • Retry after delay
  • Check status page
  • Use alternate region
  • Batch non-urgent jobs

Emotion Not Working

Flat Delivery
  • Add contextual cues
  • Use narrative descriptions
  • Lower stability
  • Use v3 tags
Overacting
  • Increase stability
  • Reduce style exaggeration
  • Use calmer language
  • Switch to Turbo

Character Limit Issues

Text Too Long
  • Flash/Turbo: 40k limit
  • Multilingual: 5k limit
  • v3: 3k limit
  • Split by sentences
Streaming Overflows
  • Send 200-500 char chunks
  • Flush after sections
  • Use queue for long form
  • Store progress markers

Download Cheat Sheet

Get the complete visual reference guide