Open-weight text-to-speech
Voxtral TTS is multilingual, expressive text-to-speech built for voice agents.
Voxtral TTS turns text into natural speech across 9 languages, supports 2-3 seconds of reference audio for zero-shot voice cloning, and serves both streaming and batch experiences with low latency.
Why Voxtral TTS
Open weights, expressive prosody, and deployment paths that fit real production constraints.
- Zero-shot voice cloning from short 2-3 second audio prompts
- Natural prosody, emotion transfer, and voice-as-instruction behavior
- 24 kHz output with streaming and batch inference for agent workflows
Live workspace
Voxtral TTS Demo
Try the official Hugging Face Space powered by Mistral's Voxtral TTS stack. First load can take a moment while the demo instance wakes up.
Demo environment
Official Voxtral TTS Space
If the interface stays blank briefly, keep the tab open while the Hugging Face Space finishes its cold start.
Benefits
Why teams evaluate Voxtral TTS
Voxtral TTS combines open deployment options with speech quality that feels natural in user-facing products.
Natural speech and emotion
Generate realistic, expressive speech that preserves rhythm, intonation, and emotional color instead of flat robotic narration.
Fast voice adaptation
Clone and adapt voices from short prompts, making it easier to prototype branded assistants and personalized narrators.
Low-latency voice delivery
Use streaming and batch generation paths for voice agents, live experiences, and high-throughput speech pipelines.
Use cases
Where Voxtral TTS fits best
The model is positioned for production voice applications where quality, control, and deployment flexibility matter.
Customer Support
Build support assistants with natural phone-ready speech, voice consistency, and responsive streaming outputs.
Financial Services
Power KYC flows, onboarding agents, and transaction guidance with branded voices and auditable outputs.
Real-time Translation
Pair multilingual speech generation with translation pipelines to deliver fluent spoken experiences in multiple regions.
Automotive Systems
Deploy in-vehicle prompts, copilots, and infotainment voices that feel more human without proprietary lock-in.
Features
Core Voxtral TTS capabilities
Official model and docs highlights condensed into the capabilities most relevant for product and developer teams.
9-language support
English, French, Spanish, Portuguese, Italian, Dutch, German, Hindi, and Arabic with code-mixing and cross-lingual cloning support.
20 preset voices
Start from built-in voices for rapid experimentation before saving reusable profiles in production workflows.
2-3 second voice cloning
Use a short audio sample to transfer voice identity, style, accent, and emotional rendering without retraining.
Streaming and batch inference
Choose low-latency streaming for interactive systems or batch generation for larger content pipelines.
24 kHz audio output
Export speech in WAV, PCM, FLAC, MP3, AAC, and Opus formats for playback, post-processing, or API delivery.
Open weights with vLLM Omni
Serve locally through vLLM Omni and run the 4B model on a single GPU with at least 16 GB of memory.
FAQ
Voxtral TTS FAQ
Quick answers for teams exploring Voxtral TTS for production voice applications.
What license does Voxtral TTS use?
The official Hugging Face model card lists Voxtral TTS under a CC BY-NC 4.0 family license because the bundled reference voices inherit that licensing. Review the model card before commercial deployment.
Which languages does Voxtral TTS support?
Official docs list English, French, Spanish, Portuguese, Italian, Dutch, German, Hindi, and Arabic, with support for cross-lingual voice cloning.
How much reference audio do I need for voice cloning?
Mistral documents zero-shot voice cloning from as little as 2-3 seconds of audio, which is enough to transfer voice identity and style.
Does Voxtral TTS support streaming?
Yes. Mistral documents low-latency streaming with roughly 90 ms model processing time, alongside batch generation for offline workflows.
How do I self-host Voxtral TTS?
The model card recommends vLLM Omni for serving. The 4B model can run on a single GPU with at least 16 GB memory, depending on your deployment configuration.
Where can I try Voxtral TTS right now?
You can test the official Hugging Face Space embedded above, inspect the model card on Hugging Face, and read the API docs in Mistral's text-to-speech documentation.