ElevenLabs Cheat Sheet

Complete quick reference guide to AI voice generation

This comprehensive cheat sheet covers everything you need to know about ElevenLabs: text-to-speech (TTS), voice cloning, API usage, model selection, and advanced techniques for creating natural-sounding AI voices.

Reading time:≈ 7 minutes

Last updated:November 11, 2025

Quick Start Guide

Get results in 5 minutes

Complete Working Example

Copy this cURL command, replace API_KEY and run it to generate your first voice:

curl -X POST https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM \
  -H "xi-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Welcome to ElevenLabs. This is your first generated voice.",
    "model_id": "eleven_flash_v2_5",
    "voice_settings": {
      "stability": 0.5,
      "similarity_boost": 0.75
    }
  }' \
  --output speech.mp3

Voice ID used: 21m00Tcm4TlvDq8ikWAM (Rachel - default female voice)
Get your API key: elevenlabs.io → Developer → API Key

Choose Your Model

Need emotion & drama?

→ eleven_turbo_v2_5

Dialogue, characters, storytelling

Need speed (real-time)?

→ eleven_flash_v2_5

Chatbots, live agents (75ms latency)

Need quality & languages?

→ eleven_multilingual_v2

29 languages, premium quality

Need multi-speaker?

→ eleven_v3

Conversations, multiple characters

Tune Voice Settings

Stability0.0 - 1.0

Low (0.3): Varied & expressive
High (0.8): Consistent & predictable

Similarity Boost0.0 - 1.0

Higher = Closer to original voice
Sweet spot: 0.75 for most cases

Style Exaggeration0.0 - 1.0

Amplifies voice's natural style
Start at 0.0, increase if needed

Models Comparison

Choose the right model for your use case

Model	Best For	Latency	Languages	Char Limit	Key Features
Flash v2.5 `eleven_flash_v2_5`	Real-time applications Chatbots, live agents, conversational AI	75ms Ultra-fast	32 languages	40,000 ~40 min audio	• Lowest latency • Optimized for speed • WebSocket streaming • Good quality/speed balance
Turbo v2.5 `eleven_turbo_v2_5`	Balanced use cases General purpose, narration, IVR	~300ms Fast	32 languages	40,000 ~40 min audio	• Better quality than Flash • Still fast enough for real-time • More emotional range • Best all-rounder
Multilingual v2 `eleven_multilingual_v2`	Premium quality Audiobooks, podcasts, videos	~1-2s Slower	29 languages Premium quality	5,000 ~5 min audio	• Highest quality • Best for long-form content • Natural prosody • Accent preservation
Eleven v3 `eleven_v3`	Emotional & expressive Storytelling, character voices, drama	~1-2s Slower	70+ languages Most supported	3,000 ~3 min audio	• Multi-speaker dialogue • Audio tags [laughs] [whispers] • Most emotional depth • Character acting
Scribe v1 `scribe_v1`	Speech-to-text Transcription, voice cloning prep	Variable Depends on audio	99 languages Most coverage	Audio file based	• Converts speech to text • Supports most languages • Use before voice cloning • High accuracy

Need Speed?

Real-time, conversational AI, low latency required

→ Flash v2.5

Need Quality?

Audiobooks, premium content, natural prosody

→ Multilingual v2

Need Emotion?

Drama, character voices, expressive storytelling

→ Eleven v3

Need Balance?

General purpose, good quality and speed

→ Turbo v2.5

Text Control & Formatting

Control timing, pronunciation, and speech patterns

Pauses & Timing

Break Tags (up to 3 seconds)

Insert natural pauses between sentences

Dots for Endpoints

Sentence. Next.

Combine with dots for cleaner breaks

Warning: Too many breaks = instability

Pronunciation Control

CMU Arpabet (Recommended)

Madison

IPA (International)

actually

Compatible: Flash v2, Turbo v2, English v1

Alias Tags

For Other Models (v3, Multilingual)

UN United Nations

Case-sensitive

Exact matching

Context-aware

Understands usage

Speed & Pacing

Speed Range0.7 - 1.2

Slow (0.7)Deliberate narration

Normal (1.0)Default pace

Fast (1.2)Quick delivery

Use narrative style in text for natural pacing instead of extreme speeds

Pronunciation Dictionaries (.PLS)

What They Are

Upload custom pronunciation files
Reusable across multiple requests
Combine phonemes + aliases
Case-sensitive matching

Useful Tools

Sequitur G2P

Grapheme-to-phoneme conversion

Phonetisaurus

Phoneme sequence modeling

eSpeak

Text-to-phoneme synthesis

Emotion & Expression

Make AI voices sound human and expressive

Narrative Context

Add emotional context around dialogue for natural expression

Sadness Example

"You're leaving?" she asked, her voice trembling with sadness.

Triumph Example

"That's it!" he exclaimed triumphantly.

Works with all models - most reliable method for emotion

v3 Audio Tags

Special tags for Eleven v3 model only

Voice Tags

[laughs][whispers][sighs][exhales]

[sarcastic][curious][excited][crying]

Sound Effects

[gunshot][applause][clapping]

Special Tags & Effects

Accent Control

[strong French accent] Bonjour!

Replace "French" with any accent

Other Special Tags

[sings]Singing voice

[woo]Exclamation

Punctuation Techniques

CAPITALS for Emphasis

This is AMAZING!

Adds stress and volume

Ellipses for Weight

I don't know... maybe...

Creates hesitation and pauses

Question Marks

You did what?

Natural upward inflection

Eleven v3 Mode Settings

Choose the right mode for your emotional needs

Creative

• Most emotional

• Prone to hallucinations

• Best for dramatic content

Use for: Character acting, storytelling

Natural

• Balanced approach

• Closest to original voice

• Good responsiveness

Use for: General purpose, audiobooks

Robust

• Very stable output

• Less responsive to prompts

• Consistent quality

Use for: Technical content, narration

Voice Selection & Settings

Choose and configure the perfect voice

Voice Library

Pre-built Collections

22+ excellent v3-compatible voices available

Emotionally diverse range

Professional quality

Ready to use

Popular Voices

Rachel - Neutral female

Adam - Deep male

Bella - Soft female

Voice Cloning Types

IVC (Instant Voice Cloning)

Upload 1-2 minutes of audio, instant results

Currently better for v3

PVC (Professional Voice Cloning)

3+ hours of audio, highest quality

Studio-grade quality

Longer processing time

Voice Settings Presets

Narration

Stability0.6-0.8

Similarity0.7

Style0.2-0.4

Audiobooks, documentation, training content

Customer Support

Stability0.7-0.9

Similarity0.9

Style0.0-0.2

Clear, consistent voice for customer communication

Characters

Stability0.3-0.5

Similarity0.8

Style0.7-1.0

Dramatic, highly expressive character voices

Speaker Boost

What It Does

Enhances voice clarity and presence, reduces artifacts

When to Use

Professional recordings

Audiobooks & podcasts

Not for real-time (adds latency)

Selection Tips

Test Multiple Voices

Same text sounds different across voices

Match Content Type

Formal content needs neutral voices

Consider Audience

Age, region, and preferences matter

Regenerate Feature

Use regenerate (3x) for best take

WebSocket Streaming

Real-time audio generation with ultra-low latency

Connection Setup

WebSocket Endpoint

wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream

Required Headers

xi-api-key: YOUR_API_KEY

Use Flash v2.5 for best streaming performance (75ms latency)

Message Protocol

Send Configuration (JSON)

{ "text": "Hello world", "model_id": "eleven_flash_v2_5", "voice_settings": { "stability": 0.5, "similarity_boost": 0.75 } }

Receive Audio Chunks

Binary data (MP3) streams back in real-time

Chunk Management

Send Text in Chunks

ws.send(JSON.stringify({ text: "First chunk. " })) ws.send(JSON.stringify({ text: "Second chunk." }))

Stream text incrementally for faster response

Optimal Chunk Size

Minimum50 chars

Recommended200-400 chars

Maximum1,000 chars

Flushing & Completion

Flush Message

ws.send(JSON.stringify({ flush: True }))

Sends remaining buffers for clean ending

Completion Signal

Empty binary message indicates end of stream

Buffering Strategy

Initial Buffer

Wait ~250ms before playback to avoid stutter

Rolling Buffer

Always keep 150-200ms of audio ahead

Backpressure

Pause sending text if playback lags

Error Handling

Common Errors

401 - Invalid API key
429 - Rate limited
400 - Invalid chunk format

Retries

Exponential backoff (250ms → 4s)

Performance Tips

Minimize Latency

Use Flash v2.5 model
Send 200-500 char chunks
Flush at sentence breaks
Keep connection alive

Optimize Quality

Buffer 300-500ms initially
Use complete sentences
Avoid mid-word chunks
Proper punctuation

Handle Errors

Implement reconnection
Monitor connection health
Handle rate limits
Log failed chunks

API Quick Reference

Essential endpoints and code examples

Text-to-Speech (POST)

Endpoint

POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}

Required Headers

xi-api-key: YOUR_API_KEY

Content-Type: application/json

Request Body

{ "text": "Your text here", "model_id": "eleven_turbo_v2_5", "voice_settings": { "stability": 0.5, "similarity_boost": 0.75, "style": 0.0, "use_speaker_boost": true } }

Python Example

import requests

url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}"

headers = {
    "xi-api-key": API_KEY,
    "Content-Type": "application/json"
}

data = {
    "text": "Hello world",
    "model_id": "eleven_turbo_v2_5"
}

response = requests.post(url, json=data, headers=headers)

with open("output.mp3", "wb") as f:
    f.write(response.content)

JavaScript Example

const response = await fetch(url, {
  method: 'POST',
  headers: {
    'xi-api-key': API_KEY,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ text: 'Hello world' })
});

const audioBuffer = await response.arrayBuffer();
const blob = new Blob([audioBuffer], { type: 'audio/mpeg' });
const urlObject = URL.createObjectURL(blob);

List Voices (GET)

GET https://api.elevenlabs.io/v1/voices

Returns all available voices with metadata

List Models (GET)

GET https://api.elevenlabs.io/v1/models

See available model IDs and capabilities

Voice Cloning (POST)

POST https://api.elevenlabs.io/v1/voices/add

Upload audio files
Provide descriptive name
Choose cloning type (IVC/PVC)
Optional: description, labels

Language Support

Multilingual voice generation across 70+ languages

Language Coverage by Model

Flash v2.5 & Turbo v2.5

Major global languages

Multilingual v2

Premium quality support

Eleven v3

70+

Most comprehensive

Language Codes

{ "text": "Hola, ¿cómo estás?", "model_id": "eleven_multilingual_v2", "language_code": "es" }

Always set language_code for best accuracy

Multilingual Tips

Consistency

Keep punctuation style consistent with the language

Context

Provide regional context (e.g., Mexico vs Spain)

Accents

Use accent tags for better authenticity

Regional Variants

Spanish: es-ES, es-MX, es-AR
Portuguese: pt-PT, pt-BR
English: en-US, en-UK, en-AU
French: fr-FR, fr-CA

Code Switching

"Welcome to Mexico City. Bienvenidos a la Ciudad de México."

Use full sentences per language for best effect

Asian Language Notes

Chinese

Simplified (zh-CN)
Traditional (zh-TW)
Tone-aware generation
Character recognition

Japanese

Hiragana support
Katakana support
Kanji recognition
Pitch accent aware

Korean

Hangul support
Hanja recognition
Formal/informal tones
Natural intonation

Special Characters

Accented Characters

á é í ó ú ñ ç ü ö ä

Cyrillic Script

Russian, Ukrainian, Bulgarian

Right-to-Left

Arabic, Hebrew support

Diacritics

Proper pronunciation handling

Automatic Detection

How It Works

Models can auto-detect language without code

Works for major languages

Less accurate than explicit code

Always prefer language_code

Troubleshooting

Common issues and solutions

Audio Quality Issues

Robotic/Unnatural Sound

Lower stability (try 0.3-0.5)
Increase similarity boost
Add narrative context
Try different voice

Inconsistent Output

Increase stability (0.7-0.8)
Use Turbo v2 instead of v3
Remove excessive breaks
Simplify text formatting

Artifacts/Glitches

Enable speaker boost
Reduce style exaggeration
Fix text punctuation
Regenerate (use 3x feature)

Voice Cloning Problems

Poor Voice Match

Use 2+ minutes of audio
Ensure clean recording
Single speaker only
No background noise

Clone Sounds Different

Match audio style to use case
Adjust similarity boost
Try PVC instead of IVC
Use v3 model for IVC

Upload Fails

Check file format (MP3/WAV)
Max 10 files per upload
1-2 min total duration
Remove silence/pauses

Pronunciation Issues

Mispronounced Names

Use phoneme tags
Add pronunciation dictionary
Provide phonetic hints
Spell out syllables

Foreign Words

Set language_code
Use narrative context
Try accent tags
Break into syllables

Latency Issues

Slow Responses

Use Flash v2.5
Shorten text chunks
Avoid complex tags
Use WebSocket streaming

Queue Delays

Retry after delay
Check status page
Use alternate region
Batch non-urgent jobs

Emotion Not Working

Flat Delivery

Add contextual cues
Use narrative descriptions
Lower stability
Use v3 tags

Overacting

Increase stability
Reduce style exaggeration
Use calmer language
Switch to Turbo

Character Limit Issues

Text Too Long

Flash/Turbo: 40k limit
Multilingual: 5k limit
v3: 3k limit
Split by sentences

Streaming Overflows

Send 200-500 char chunks
Flush after sections
Use queue for long form
Store progress markers

Download Cheat Sheet

Get the complete visual reference guide

Download PNG