Skip to content

VoxCPM: Studio-Quality Voice Synthesis You Can Run Locally

April 16, 2026

VoxCPM: Studio-Quality Voice Synthesis You Can Run Locally

Most text-to-speech tools are fine until you actually need them. ElevenLabs sounds great but charges per character. Google Cloud TTS is reliable but requires an API key, a billing account, and the quiet acceptance that every sentence you synthesize leaves your machine. OpenAI TTS is good and reasonably priced, but you're still sending your text to someone else's server.

If you're building something where that matters -- game dialogue, offline tools, privacy-sensitive applications, or just a project where you'd rather not think about API bills -- the open-source alternatives used to be pretty rough. Robotic voices, weird prosody, models that required a week of fine-tuning before they sounded usable.

VoxCPM changed that.

What VoxCPM actually is

VoxCPM is a 2-billion parameter text-to-speech model from OpenBMB. It generates natural, expressive speech directly from text, and it supports 30+ languages out of the box. The audio comes out at 48kHz.

The part that's genuinely interesting: you control the voice with a text description. No reference audio, no voice cloning, no separate voice encoder. You just prepend a description in parentheses and the model figures it out:

(Young woman, warm and thoughtful)The server has been down for six hours.

That's it. The model synthesizes a voice that matches the description. You can specify age, gender, tone, energy level -- whatever you want to describe. Some combinations work better than others, but the flexibility is real.

The model runs on CUDA (NVIDIA GPUs), Apple Silicon via MPS, or CPU as a fallback. On a GPU, you're looking at 2-5 seconds per generation. On CPU, 15-30 seconds. Minimum hardware is 8GB RAM and either 8GB VRAM (NVIDIA) or 6GB unified memory (Apple Silicon M1+).

VoxCPM is open-source and free. No licensing, no API keys. The weights are about 4GB, downloaded automatically from HuggingFace on first run.

More at https://voxcpm.com/en/.

The text-to-voice API wrapper

Running VoxCPM directly works, but wrapping it in a REST API makes it useful for actual applications. I built prafullsalunke/text-to-voice for exactly this -- a lightweight FastAPI server that handles the model lifecycle and exposes a clean HTTP interface.

The server loads the model once at startup using FastAPI's lifespan context manager, keeps it in memory, and handles requests as they come in. No model reload per request. The 4GB stays loaded and responds in seconds rather than loading from scratch every time.

Two endpoints:

  • GET /health -- model status, device type, VRAM usage
  • POST /synthesize -- converts text to WAV audio

Installation

Install VoxCPM via pipx (recommended for CLI use):

pipx install voxcpm

Or in a virtual environment:

pip install voxcpm

Clone the API server and install its dependencies:

git clone https://github.com/prafullsalunke/text-to-voice
cd text-to-voice
pip install fastapi uvicorn pydantic-settings soundfile numpy

Start the server:

uvicorn main:app --port 8000

The first run downloads the model weights (~4GB). After that, startup takes 10-30 seconds while the model loads into memory.

Basic usage

Send a POST request with your text:

curl -X POST http://localhost:8000/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "The deployment pipeline finished without errors."}' \
  --output output.wav

Add a voice description to control how it sounds:

curl -X POST http://localhost:8000/synthesize \
  -H "Content-Type: application/json" \
  -d '{
    "text": "The deployment pipeline finished without errors.",
    "voice_description": "Middle-aged man, calm and professional"
  }' \
  --output output.wav

From JavaScript:

const res = await fetch('http://localhost:8000/synthesize', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    text: 'Hello from the other side of the API.',
    voice_description: 'Young woman, energetic',
  }),
});
const blob = await res.blob();
new Audio(URL.createObjectURL(blob)).play();

Voice design parameters

Three parameters control synthesis quality and voice style:

voice_description -- Natural language description of the voice. The model prepends it to the text in (description)text format before synthesis. You can be as specific or vague as you like. "Narrator" works. "Elderly Japanese man, gravelly voice" also works.

cfg_value -- Controls how strictly the model adheres to the description. Default is 2.0. Higher values (3-5) stick closer to the description but can sound less natural. Lower values give the model more latitude.

inference_timesteps -- Quality vs. speed tradeoff. Default is 10. Going higher improves audio quality at the cost of generation time. For production, 10-20 is usually enough.

Configuration

The server reads from environment variables or a .env file:

MODEL_ID=openbmb/VoxCPM2       # or openbmb/VoxCPM1.5 for a lighter model
PORT=8000
MAX_TEXT_LENGTH=500             # characters per request

The lighter VoxCPM1.5 model is an option if you're memory-constrained. Quality is lower but it fits on hardware with less VRAM.

Exposing the server remotely

If you want to call the API from a remote frontend or a different machine, Cloudflare Tunnel is the easiest path. No firewall rules, no port forwarding:

# Quick tunnel (temporary URL, no account needed)
cloudflared tunnel --url http://localhost:8000

# Named tunnel (persistent URL, requires Cloudflare account)
cloudflared tunnel login
cloudflared tunnel create text-to-voice
cloudflared tunnel run --url http://localhost:8000 text-to-voice

The server has CORS configured for localhost by default. Add your own origin in main.py if you need to.

Health check

Before synthesizing, check if the model loaded successfully:

curl http://localhost:8000/health

Response:

{
  "status": "ready",
  "device": "mps",
  "vram_used_mb": 4096
}

If status is "loading", the model is still initializing. The server returns a 503 during that window.

When this is and isn't the right tool

VoxCPM is good for:

  • Offline voice synthesis where data can't leave the machine
  • Projects where per-character API billing doesn't make sense
  • Game or interactive media dialogue where you need many voices without a reference recording for each
  • Prototyping voice features before deciding whether to pay for a commercial API

It's probably not the right call for:

  • Low-latency real-time speech (CPU fallback is slow, and GPU generation is 2-5 seconds)
  • Very long texts (the 500-character default limit exists for a reason -- synthesis quality degrades and generation time increases with longer inputs)
  • Voices requiring extreme consistency across many outputs (descriptions are interpreted stochastically, so the same description can produce slight variations)

The full model documentation and voice samples are at https://voxcpm.com/en/.

What's under the hood

The VoxCPM2 architecture is tokenizer-free -- it generates continuous speech waveforms directly from text rather than going through an intermediate phoneme or token representation. That's part of why the voice description trick works: the model processes the full (description)text string together and infers how to shape the speech from the combined context.

The denoiser is loaded with load_denoiser=False by default. This reduces VRAM usage significantly (the model alone is ~4GB; the denoiser would push that higher). For most applications the output quality without the denoiser is fine.

If you're curious about the implementation details, the test suite in the repo covers all the edge cases -- device detection, synthesis failure handling, VRAM tracking, parameter forwarding. Running it requires no GPU and no model download since all VoxCPM dependencies are mocked in the fixtures.

Share this article