Open-weight text-to-speech

Voxtral TTS is multilingual, expressive text-to-speech built for voice agents.

Voxtral TTS turns text into natural speech across 9 languages, supports 2-3 seconds of reference audio for zero-shot voice cloning, and serves both streaming and batch experiences with low latency.

Try Voxtral TTS Demo View Model Card

9 languages Multilingual voices with cross-lingual cloning support

20 preset voices Built-in voice options with fast adaptation paths

70-90 ms latency Low model processing time for responsive audio flows

Why Voxtral TTS

Open weights, expressive prosody, and deployment paths that fit real production constraints.

Zero-shot voice cloning from short 2-3 second audio prompts
Natural prosody, emotion transfer, and voice-as-instruction behavior
24 kHz output with streaming and batch inference for agent workflows

Live workspace

Voxtral TTS Demo

Try the official Hugging Face Space powered by Mistral's Voxtral TTS stack. First load can take a moment while the demo instance wakes up.

Demo environment

Official Voxtral TTS Space

Hosted on HF Spaces

If the interface stays blank briefly, keep the tab open while the Hugging Face Space finishes its cold start.

Benefits

Why teams evaluate Voxtral TTS

Voxtral TTS combines open deployment options with speech quality that feels natural in user-facing products.

Natural speech and emotion

Generate realistic, expressive speech that preserves rhythm, intonation, and emotional color instead of flat robotic narration.

Fast voice adaptation

Clone and adapt voices from short prompts, making it easier to prototype branded assistants and personalized narrators.

Low-latency voice delivery

Use streaming and batch generation paths for voice agents, live experiences, and high-throughput speech pipelines.

Use cases

Where Voxtral TTS fits best

The model is positioned for production voice applications where quality, control, and deployment flexibility matter.

Customer Support

Build support assistants with natural phone-ready speech, voice consistency, and responsive streaming outputs.

Financial Services

Power KYC flows, onboarding agents, and transaction guidance with branded voices and auditable outputs.

Real-time Translation

Pair multilingual speech generation with translation pipelines to deliver fluent spoken experiences in multiple regions.

Automotive Systems

Deploy in-vehicle prompts, copilots, and infotainment voices that feel more human without proprietary lock-in.

Features

Core Voxtral TTS capabilities

Official model and docs highlights condensed into the capabilities most relevant for product and developer teams.

9-language support

English, French, Spanish, Portuguese, Italian, Dutch, German, Hindi, and Arabic with code-mixing and cross-lingual cloning support.

20 preset voices

Start from built-in voices for rapid experimentation before saving reusable profiles in production workflows.

2-3 second voice cloning

Use a short audio sample to transfer voice identity, style, accent, and emotional rendering without retraining.

Streaming and batch inference

Choose low-latency streaming for interactive systems or batch generation for larger content pipelines.

24 kHz audio output

Export speech in WAV, PCM, FLAC, MP3, AAC, and Opus formats for playback, post-processing, or API delivery.

Open weights with vLLM Omni

Serve locally through vLLM Omni and run the 4B model on a single GPU with at least 16 GB of memory.

FAQ

Voxtral TTS FAQ

Quick answers for teams exploring Voxtral TTS for production voice applications.

What license does Voxtral TTS use?

The official Hugging Face model card lists Voxtral TTS under a CC BY-NC 4.0 family license because the bundled reference voices inherit that licensing. Review the model card before commercial deployment.

Which languages does Voxtral TTS support?

Official docs list English, French, Spanish, Portuguese, Italian, Dutch, German, Hindi, and Arabic, with support for cross-lingual voice cloning.

How much reference audio do I need for voice cloning?

Mistral documents zero-shot voice cloning from as little as 2-3 seconds of audio, which is enough to transfer voice identity and style.

Does Voxtral TTS support streaming?

Yes. Mistral documents low-latency streaming with roughly 90 ms model processing time, alongside batch generation for offline workflows.

How do I self-host Voxtral TTS?

The model card recommends vLLM Omni for serving. The 4B model can run on a single GPU with at least 16 GB memory, depending on your deployment configuration.

Where can I try Voxtral TTS right now?

You can test the official Hugging Face Space embedded above, inspect the model card on Hugging Face, and read the API docs in Mistral's text-to-speech documentation.