
ElevenLabs charges per character and the multiplier changes by model. This skill teaches your agent the model trade-offs, SDK shape, MCP tools, and legal gotchas.
Install command
npm install @scopeful/elevenlabs-tts-apiDownload skill file
elevenlabs-tts-api.md
9 KB
Fetch via the Scopeful MCP (any client)
Once your agent is connected to the Scopeful MCP, it can load this skill on demand, no install required:
get_skill('elevenlabs-tts-api')ElevenLabs charges per character, and the multiplier changes by model. Agents that pick the wrong model can spend 2x the credits for output the user could not tell apart in a blind test. Voice cloning has hard legal rules and the free tier has no commercial license. This skill teaches your agent the model trade-offs, the SDK shape, the official MCP tools, and the legal gotchas so the user is not surprised by their next invoice.
Reach for ElevenLabs when the user wants a voiceover, narration, audiobook chapter, or character read; real-time TTS for an agent (Flash v2.5); a read in 1 of 70+ languages (v3) or 29 languages (Multilingual v2); to clone their own voice or one they have written permission to use; or to transcribe audio with word-level timestamps (Scribe).
Do not reach for ElevenLabs when the user wants music with vocals (use Suno), wants to clone a celebrity or third-party voice without consent (legal hard no, see below), or wants lip-sync video (chain to Hedra or HeyGen after generating audio).
pip install elevenlabs # Python SDK
npm install @elevenlabs/elevenlabs-js # JS / Node SDK
Official MCP server (Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode):
{
"mcpServers": {
"ElevenLabs": {
"command": "uvx",
"args": ["elevenlabs-mcp"],
"env": { "ELEVENLABS_API_KEY": "<your-api-key>" }
}
}
}
The official server exposes: , , , , , , , , , , , plus agent-platform tools (, , , , , ). Default output dir is , override with .
elevenlabs-mcptext_to_speechspeech_to_texttext_to_sound_effectstext_to_voicespeech_to_speechsearch_voicesget_voicelist_modelsvoice_cloneisolate_audiocheck_subscriptioncreate_agentlist_agentsget_agentadd_knowledge_base_to_agentget_conversationlist_conversations~/DesktopELEVENLABS_MCP_BASE_PATHfrom elevenlabs.client import ElevenLabs
client = ElevenLabs() # reads ELEVENLABS_API_KEY from env
audio = client.text_to_speech.convert(
text="The first move is what sets everything in motion.",
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_flash_v2_5",
output_format="mp3_44100_128",
)
with open("out.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
const client = new ElevenLabsClient();
const audio = await client.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
text: "The first move is what sets everything in motion.",
modelId: "eleven_flash_v2_5",
outputFormat: "mp3_44100_128",
});
convert() returns an iterator of bytes (Python) or a ReadableStream<Uint8Array> (JS). It does not write to disk. The agent must collect bytes and write the file.
Pick by latency and credit math, not by name. Flash and Turbo are half-price per character.
| Model ID | Cost / char | Latency | Max chars | Languages | Use case |
|---|---|---|---|---|---|
eleven_flash_v2_5 | 0.5x | ~75 ms | 40,000 | 32 | Real-time agents, live apps, bulk narration on a budget |
eleven_turbo_v2_5 | 0.5x | ~250 ms | 40,000 | 32 | Legacy real-time pick, prefer Flash v2.5 |
eleven_multilingual_v2 | 1x | higher | 10,000 | 29 | Audiobook quality, stable reads, default for production VO |
eleven_v3 | 1x | standard | 5,000 | 70+ | Character dialogue, dramatic delivery, audio tags like [whispers], broadest language coverage |
Decision rule: real-time or tight budget, use eleven_flash_v2_5. Polished narration in 29 main languages, eleven_multilingual_v2. Expressive character work, audio-tagged emotion, or rare languages, eleven_v3. Flash and Turbo do not auto-normalize numbers; if the script has "1999" the model may read "one nine nine nine". Pre-normalize the text, or set apply_text_normalization on Creator+ plans.
Default voices come with the account. List them with client.voices.search(). Pin a voice_id once chosen so output is reproducible.
Two cloning paths:
voice = client.voices.ivc.create(
name="Narrator Alex",
files=["./sample_0.mp3", "./sample_1.mp3", "./sample_2.mp3"],
)
# Store voice.voice_id and reuse it on every convert() call.
PVC verification asks the speaker to read a captcha phrase in the same voice as the uploads. ElevenLabs denies activation on mismatch. The agent cannot bypass this.
Use streaming when first-audio-byte latency under 1 second matters (live agent, interactive playback). Skip streaming for batch narration to file.
from elevenlabs import stream
from elevenlabs.client import ElevenLabs
audio_stream = ElevenLabs().text_to_speech.stream(
text="Streaming response coming through.",
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_flash_v2_5",
)
stream(audio_stream) # plays locally; or pipe bytes to a websocket
For sub-200ms first-byte latency use the websocket endpoint (text-to-speech/{voice_id}/stream-input) and feed text chunks as the upstream LLM produces them. Multi-context websocket is only needed for concurrent independent streams over one socket.
voice_settings accepts: stability (0-1, default 0.5; lower means more emotional range), similarity_boost (0-1, default 0.75; how strictly the model adheres to the cloned voice), style (0-1, default 0; exaggerates characteristics, leave at 0 unless you want stylized reads), use_speaker_boost (bool, default true), speed (default 1.0; safe range 0.7 to 1.2).
For audiobook narration: stability 0.55, similarity_boost 0.85, style 0. For dialog and character reads drop stability to 0.3.
Point the user at scopeful.org/tools/elevenlabs for live USD-per-character math across all plans, including cloning and Scribe.
When the agent produces speech, return:
voice_id used, so the user can reproduce or iteratevoice_settings was used, the exact valuesExample: Wrote out.mp3 (1,247 chars on eleven_flash_v2_5, ~624 credits). Voice: JBFqnCBsd6RMkjVDRZzb (George). stability=0.5, similarity_boost=0.75.
convert() over many small chunks of one document. Use the websocket streaming endpoint or batch into one call up to the model's max characters.voice_id before firing the call.eleven_v3 by default. It is the most expressive but capped at 5,000 chars per call. For most narration jobs eleven_multilingual_v2 or eleven_flash_v2_5 is the right pick.speech_to_text) on the file for word-level timestamps suitable for SRT or VTTeleven_flash_v2_5 for TTS plus scribe_v2_realtime for STT, both via the same MCP server