A text-to-speech model that skips audio tokens

Most modern TTS systems convert speech into discrete tokens, model those tokens, then decode back to audio. VoxCPM, from OpenBMB (the MiniCPM team), takes the less common path: it is tokenizer-free, generating continuous speech representations directly through an end-to-end diffusion autoregressive architecture. The claim behind that design is more natural, expressive output, because nothing is quantized into a discrete vocabulary on the way through.

The current release, VoxCPM2, is a 2B-parameter model built on a MiniCPM-4 backbone, trained on over two million hours of multilingual speech, covering 30 languages with 48kHz studio-quality output. It is Apache-2.0 licensed and explicitly cleared for commercial use, which is the practical detail that separates it from several capable but restrictively licensed peers.

The feature that stands out: voice design

Cloning a voice from a reference clip is table stakes now, and VoxCPM offers two grades of it: controllable cloning from a short clip, with optional style guidance to steer emotion and pace while preserving the original timbre, and an ultimate-fidelity mode where you supply both the reference audio and its transcript so the model continues seamlessly from the reference, holding onto every vocal nuance. It also infers appropriate prosody from the text content itself, so a flat input does not force a flat read. The more unusual capability is Voice Design: you describe a voice in natural language, gender, age, tone, emotion, pace, and the model synthesizes a brand-new voice from that description alone, with no reference audio at all. That is a genuinely different workflow from clone-only systems, useful when you want a consistent original voice rather than someone else’s.

Install

VoxCPM is a pip package:

pip install voxcpm

The hardware requirements are the part to read before you commit: Python 3.10 up to but not including 3.13, PyTorch 2.5.0 or newer, and CUDA 12.0 or newer. This is a GPU-first model.

A minimal synthesis call:

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
wav = model.generate(
    text="VoxCPM2 generates realistic multilingual speech.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)

Weights also pull from ModelScope if Hugging Face is slow for you. For serving, the project documents real-time factors around 0.3 on an RTX 4090, dropping to about 0.13 when accelerated through nano-vLLM or vLLM-Omni. If the 2B model is heavier than your hardware allows, the lineage also includes smaller earlier releases, a 0.5B and a 1.5B, so you can trade some quality for a lighter footprint.

What the issue tracker reveals

The README leads with capability. The open issues, as of 2026-06 numbering 121, fill in the operational reality:

  • CPU is not a smooth path. A reported issue is the CLI rejecting --device cpu as an unrecognized argument, and other threads cover CUDA errors. Combined with the CUDA 12 requirement, plan on an NVIDIA GPU; do not expect to demo this comfortably on a laptop CPU.
  • Per-language artifacts exist. The single most-discussed open issue is voice cutoff at the end of words in Polish. Broad language coverage does not mean every language is equally polished, so test the specific languages you ship.
  • Dependency friction is real. Threads about torchcodec point to the kind of audio-stack setup pain common to TTS projects. Budget time for the environment, not just the model.

These are the normal seams of a fast-moving research release (VoxCPM2 landed in 2026-04, with point releases since), not signs of abandonment.

VoxCPM versus other open TTS models

VoxCPMfish-speechChatTTSF5-TTS
Stars28,13330,73439,42214,710
Approachtokenizer-free diffusion ARtoken-basedtoken-basedflow matching
Voice design from textyesnonono
LicenseApache-2.0non-standardAGPL-3.0MIT

Counts are from GitHub as of June 2026. fish-speech and ChatTTS are popular token-based systems, but note their licenses: ChatTTS is AGPL-3.0 and fish-speech ships a non-standard license, both of which constrain commercial use more than VoxCPM’s clean Apache-2.0. F5-TTS is MIT and strong, but does not offer text-prompt voice design. VoxCPM’s combination of a permissive license, voice design, and the tokenizer-free architecture is its specific niche.

VoxCPM pairs naturally with a local language model from Ollama to build a fully self-hosted voice pipeline. For more on the model ecosystem, see LLM tooling, the daily digest, and the weekly report.

FAQ

What does “tokenizer-free” mean here? VoxCPM generates continuous speech representations directly rather than predicting discrete audio tokens, which its authors argue yields more natural output.

Can I run it on CPU? Not comfortably. It requires CUDA 12 and PyTorch 2.5+, and a reported issue shows the CLI rejecting --device cpu. Plan on an NVIDIA GPU.

Is it safe for commercial use? Yes, weights and code are Apache-2.0 licensed, which is more permissive than several alternatives.

What is voice design? Creating a new voice from a natural-language description (gender, age, tone, pace) with no reference audio, distinct from cloning an existing voice.