VoxCPM: Studio-Quality Voice Synthesis You Can Run Locally
April 16, 2026

Most text-to-speech tools are fine until you actually need them. ElevenLabs sounds great but charges per character. Google Cloud TTS is reliable but requires an API key, a billing account, and the quiet acceptance that every sentence you synthesize leaves your machine. OpenAI TTS is good and reasonably priced, but you're still sending your text to someone else's server.
If you're building something where that matters -- game dialogue, offline tools, privacy-sensitive applications, or just a project where you'd rather not think about API bills -- the open-source alternatives used to be pretty rough. Robotic voices, weird prosody, models that required a week of fine-tuning before they sounded usable.
VoxCPM changed that.
What VoxCPM actually is
VoxCPM is a 2-billion parameter text-to-speech model from OpenBMB. It generates natural, expressive speech directly from text, and it supports 30+ languages out of the box. The audio comes out at 48kHz.
The part that's genuinely interesting: you control the voice with a text description. No reference audio, no voice cloning, no separate voice encoder. You just prepend a description in parentheses and the model figures it out:
(Young woman, warm and thoughtful)The server has been down for six hours.
That's it. The model synthesizes a voice that matches the description. You can specify age, gender, tone, energy level -- whatever you want to describe. Some combinations work better than others, but the flexibility is real.
The model runs on CUDA (NVIDIA GPUs), Apple Silicon via MPS, or CPU as a fallback. On a GPU, you're looking at 2-5 seconds per generation. On CPU, 15-30 seconds. Minimum hardware is 8GB RAM and either 8GB VRAM (NVIDIA) or 6GB unified memory (Apple Silicon M1+).
VoxCPM is open-source and free. No licensing, no API keys. The weights are about 4GB, downloaded automatically from HuggingFace on first run.
More at https://voxcpm.com/en/.
The text-to-voice API wrapper
Running VoxCPM directly works, but wrapping it in a REST API makes it useful for actual applications. I built prafullsalunke/text-to-voice for exactly this -- a lightweight FastAPI server that handles the model lifecycle and exposes a clean HTTP interface.
The server loads the model once at startup using FastAPI's lifespan context manager, keeps it in memory, and handles requests as they come in. No model reload per request. The 4GB stays loaded and responds in seconds rather than loading from scratch every time.
Two endpoints:
GET /health-- model status, device type, VRAM usagePOST /synthesize-- converts text to WAV audio
Installation
Install VoxCPM via pipx (recommended for CLI use):
pipx install voxcpm
Or in a virtual environment:
pip install voxcpm
Clone the API server and install its dependencies:
git clone https://github.com/prafullsalunke/text-to-voice
cd text-to-voice
pip install fastapi uvicorn pydantic-settings soundfile numpy
Start the server:
uvicorn main:app --port 8000
The first run downloads the model weights (~4GB). After that, startup takes 10-30 seconds while the model loads into memory.
Basic usage
Send a POST request with your text:
curl -X POST http://localhost:8000/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "The deployment pipeline finished without errors."}' \
--output output.wav
Add a voice description to control how it sounds:
curl -X POST http://localhost:8000/synthesize \
-H "Content-Type: application/json" \
-d '{
"text": "The deployment pipeline finished without errors.",
"voice_description": "Middle-aged man, calm and professional"
}' \
--output output.wav
From JavaScript:
const res = await fetch('http://localhost:8000/synthesize', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
text: 'Hello from the other side of the API.',
voice_description: 'Young woman, energetic',
}),
});
const blob = await res.blob();
new Audio(URL.createObjectURL(blob)).play();
Voice design parameters
Three parameters control synthesis quality and voice style:
voice_description -- Natural language description of the voice. The model prepends it to the text in (description)text format before synthesis. You can be as specific or vague as you like. "Narrator" works. "Elderly Japanese man, gravelly voice" also works.
cfg_value -- Controls how strictly the model adheres to the description. Default is 2.0. Higher values (3-5) stick closer to the description but can sound less natural. Lower values give the model more latitude.
inference_timesteps -- Quality vs. speed tradeoff. Default is 10. Going higher improves audio quality at the cost of generation time. For production, 10-20 is usually enough.
Configuration
The server reads from environment variables or a .env file:
MODEL_ID=openbmb/VoxCPM2 # or openbmb/VoxCPM1.5 for a lighter model
PORT=8000
MAX_TEXT_LENGTH=500 # characters per request
The lighter VoxCPM1.5 model is an option if you're memory-constrained. Quality is lower but it fits on hardware with less VRAM.
Exposing the server remotely
If you want to call the API from a remote frontend or a different machine, Cloudflare Tunnel is the easiest path. No firewall rules, no port forwarding:
# Quick tunnel (temporary URL, no account needed)
cloudflared tunnel --url http://localhost:8000
# Named tunnel (persistent URL, requires Cloudflare account)
cloudflared tunnel login
cloudflared tunnel create text-to-voice
cloudflared tunnel run --url http://localhost:8000 text-to-voice
The server has CORS configured for localhost by default. Add your own origin in main.py if you need to.
Health check
Before synthesizing, check if the model loaded successfully:
curl http://localhost:8000/health
Response:
{
"status": "ready",
"device": "mps",
"vram_used_mb": 4096
}
If status is "loading", the model is still initializing. The server returns a 503 during that window.
When this is and isn't the right tool
VoxCPM is good for:
- Offline voice synthesis where data can't leave the machine
- Projects where per-character API billing doesn't make sense
- Game or interactive media dialogue where you need many voices without a reference recording for each
- Prototyping voice features before deciding whether to pay for a commercial API
It's probably not the right call for:
- Low-latency real-time speech (CPU fallback is slow, and GPU generation is 2-5 seconds)
- Very long texts (the 500-character default limit exists for a reason -- synthesis quality degrades and generation time increases with longer inputs)
- Voices requiring extreme consistency across many outputs (descriptions are interpreted stochastically, so the same description can produce slight variations)
The full model documentation and voice samples are at https://voxcpm.com/en/.
What's under the hood
The VoxCPM2 architecture is tokenizer-free -- it generates continuous speech waveforms directly from text rather than going through an intermediate phoneme or token representation. That's part of why the voice description trick works: the model processes the full (description)text string together and infers how to shape the speech from the combined context.
The denoiser is loaded with load_denoiser=False by default. This reduces VRAM usage significantly (the model alone is ~4GB; the denoiser would push that higher). For most applications the output quality without the denoiser is fine.
If you're curious about the implementation details, the test suite in the repo covers all the edge cases -- device detection, synthesis failure handling, VRAM tracking, parameter forwarding. Running it requires no GPU and no model download since all VoxCPM dependencies are mocked in the fixtures.


