TL;DR
-
Architecture choice creates a hard ceiling on what you can achieve for a new language. No amount of data or training tricks will overcome a fundamentally limited architecture.
-
Frozen codecs trained on English-dominant data are the biggest bottleneck. If the codec’s codebook doesn’t have entries for your target phonemes, those sounds simply cannot be generated.
-
Continuous representations (mel-spectrograms) are language-agnostic by nature. They sidestep the codec bottleneck entirely.
-
Phoneme vs. grapheme choice has real consequences, especially for languages with poor grapheme-to-phoneme tools. Bad G2P corrupts your data before training even starts.
-
Multi-stage fine-tuning is the practical path for low-resource language adaptation: standard language first, then dialect, then target speaker.
Introduction
English TTS is basically solved at this point. There are multiple open-source models that sound genuinely human, and the research community has largely moved on to finer problems like emotion control and on-device efficiency.
But step outside English, and things break fast.
If you’ve read my earlier posts on Arabic ASR dataset curation and speech augmentation, you already know the pattern. Arabic has pharyngeals (ع, ح), uvulars (ق, غ, خ), and emphatics (ط, ض, ص, ظ), and each produces spectral patterns completely absent from English. An ASR model just needs to recognize these sounds. A TTS model needs to produce them. If the model’s internal representation can’t encode those sounds, nothing downstream will fix it. You can’t generate what you can’t represent.
I’ve spent the past few weeks researching TTS architectures specifically through the lens of Egyptian Arabic adaptation: testing models, reading papers, digging into codebases, and cataloging every open-source option I could find. This post is what I wish I’d had when I started: a practical breakdown of what architectural choices actually mean for anyone trying to make a model speak a language it’s never heard.
The Three Flavors of TTS Architecture
Modern open-source TTS models fall into three families. Each handles new languages very differently, and understanding these differences would have saved me a lot of wasted compute.
Codec Language Models (Autoregressive, Discrete Tokens)
Audio gets discretized into tokens by a pre-trained audio codec, a frozen encoder that maps audio to a fixed vocabulary of discrete codes. An LLM backbone then predicts these tokens autoregressively, one after another. This is where I started, because the appeal is hard to resist: you get access to the full LLM fine-tuning ecosystem. LoRA, quantization, standard training recipes, all of it works. Adaptation is fast and familiar if you’ve ever fine-tuned a language model.
The catch: quality is permanently capped by what the frozen codec can represent. The codec was trained on its own dataset (usually English-heavy) and learned a fixed codebook of audio “atoms.” If your target language has sounds that weren’t well represented in that data, those sounds have no proper codebook entry. The LLM can only predict token IDs that exist.
Flow-Matching Models (Non-Autoregressive, Continuous)
These models generate entire mel-spectrograms at once using learned vector fields that transform noise into speech. They operate in continuous space with no discrete tokenization anywhere. A separate vocoder converts the mel-spectrogram to a waveform at inference, but that vocoder is just synthesizing frequencies. It doesn’t need to “understand” Arabic.
The tradeoff: flow-matching models need more data and more training steps (50K+ is common). The quality floor is lower at first because the model has to learn everything from scratch. But the ceiling is the highest of any architecture, because there’s no quantization bottleneck anywhere in the pipeline. If you’re building a voice agent pipeline where latency matters, the non-autoregressive inference is also a plus.
Hybrid Approaches (AR Semantic + Continuous Acoustic)
Two-stage designs where an LLM front-end predicts semantic or coarse speech tokens, and then a flow-matching back-end converts those into fine-grained mel-spectrograms. You get the LLM’s sense of rhythm and pacing on top, with the continuous model’s uncapped audio quality underneath.
The adaptability depends entirely on what the autoregressive stage outputs. If it uses learnable semantic tokens tied to the model (trainable during fine-tuning), the bottleneck is soft. If it uses a frozen audio codec for the AR stage, the same hard ceiling as pure codec models applies. Not all hybrids are equal. Check what’s frozen before committing.
Architecture Comparison at a Glance
| Property | Codec LM (Discrete) | Flow-Matching (Continuous) | Hybrid (AR + Continuous) |
|---|---|---|---|
| Audio representation | Discrete tokens from frozen codebook | Continuous mel-spectrograms | Semantic tokens + continuous mel |
| New language ceiling | Hard limit if phonemes missing | No limit (mel-space is universal) | Depends on AR stage design |
| Fine-tuning approach | LoRA on LLM (fast, familiar) | Full model or LoRA (needs more steps) | LoRA on both stages |
| Quality floor | High (codec ensures clean audio) | Low (must learn from scratch) | Medium (LLM provides structure) |
| Quality ceiling | Capped by codec | Highest (no bottleneck) | High if AR stage is learnable |
The Codec Bottleneck
This is the thing I wish someone had told me before I started my experiments. It would have saved me days.
How codecs work (briefly):
Most discrete TTS codecs use Residual Vector Quantization (RVQ). Audio goes through an encoder, and the continuous representation gets quantized into sequences of discrete codes from a learned codebook. Multiple codebook layers refine the residuals for higher fidelity: the first layer captures coarse structure, later layers add detail. The codebook is learned during codec training, then frozen permanently. (This is the part that comes back to bite you.)
The representation constraint:
The codec’s codebook reflects its training data distribution. The dominant open-source codecs were trained on datasets that are overwhelmingly English and European. One popular codec was trained on 10,000 hours of English-only data. Ten thousand hours. Another trained on 9,000+ hours where Arabic represented roughly 70 hours, less than 1%. And people wonder why it can’t say ع.
Arabic pharyngeals and emphatics produce spectral patterns that look nothing like English. When such a codec encounters the sound ع, it does the only thing it can: snap it to the nearest available codebook entry, which is some non-Arabic approximation. The LLM sitting on top can learn to predict the “most Arabic-like” token IDs available, but if no token ID decodes to a proper ع, no amount of fine-tuning fixes that. The codec is the ceiling, and the ceiling is frozen.
Here’s what makes this insidious: the output will still sound clean. The frozen decoder was trained to produce pleasant-sounding audio regardless of what tokens it receives. So you get confident, smooth audio that is phonemically wrong. In my experiments, I heard exactly this. Perfectly clear audio that was recognizably “trying” to be Arabic but failing at every phoneme. Gibberish delivered with confidence.
The continuous alternative:
A mel-spectrogram is just a matrix of frequency magnitudes over time. Arabic ع has a perfectly valid spectral pattern, and it’s just floating-point numbers. The model can learn to produce any pattern through fine-tuning, and the vocoder then converts it to audio. Modern vocoders handle this with minimal information loss. A vocoder trained entirely on English can still faithfully reconstruct an Arabic mel, because it’s synthesizing frequencies, not interpreting language.
I keep coming back to this analogy: a frozen codec is like painting with 256 pre-mixed colors someone else chose. If your shade isn’t in the palette, you’re stuck. Continuous mel-space is like painting with unlimited colors, and then someone photographs your painting. The photo loses a tiny bit of detail, but faithfully reproduces whatever you painted, including colors the photographer has never seen.
Phonemes, Graphemes, and the Text Frontend
Before any audio generation happens, the model has to process the input text.
Some models train on phonemes (IPA transcriptions), others on raw graphemes (written characters). Phoneme-based training uses a grapheme-to-phoneme (G2P) system to convert text to IPA before feeding it to the model. The advantage: a shared phonetic representation that works across languages and dialects. The disadvantage: you need a good G2P for your target language. Bad G2P introduces systematic errors before the model ever sees a single training example.
Grapheme-based training feeds raw text directly. This works surprisingly well for languages with transparent orthography, where what you write is close to what you say.
The golden rule: bad G2P is worse than no G2P. A grapheme-to-phoneme model that makes systematic errors will corrupt your training data in ways you might not catch until it’s too late. If you don’t have a reliable G2P for your dialect, graphemes are the safer bet.
For Arabic specifically, the orthography is largely phonemic (letters map closely to sounds), which makes grapheme-based training viable. But written Arabic typically omits diacritics (tashkeel), and if you’ve ever argued about whether كتب means “kataba” (he wrote), “kutub” (books), or “kutiba” (it was written), you’ve experienced the diacritics problem firsthand. Good Arabic diacritizers exist, but they’re still an active research area.
One approach worth knowing about: using IPA as a shared representation to bridge dialects. The same Arabic letter can sound quite different across regions. The letter ق is a voiceless uvular stop in MSA, a glottal stop in Egyptian Arabic, and a voiced velar stop in parts of the Levant. IPA makes these distinctions explicit rather than leaving them for the model to figure out from context.
Data Pipeline at a Glance
I wrote about dataset curation in a previous post, and those lessons apply doubly here. The priority ordering for TTS data:
Transcript accuracy > Audio quality > Speaker diversity > Total hours.
10 hours of perfectly transcribed, clean audio from diverse speakers will outperform 100 hours of noisy, poorly transcribed recordings. For TTS, transcript errors don’t just hurt training. They teach the model to mispronounce words. The model will faithfully learn whatever mapping you give it, including the wrong ones.
Core requirements:
- Clean audio: 16kHz minimum, good signal-to-noise ratio, 3 to 15 second segments
- Accurate transcripts: Under 5% WER for early training stages, under 2% for voice adaptation. For Arabic, this means dialect-aware transcription, not MSA conventions applied to Egyptian speech
- Speaker diversity: Especially for dialect adaptation, where 50 speakers at 30 minutes each outperform 5 speakers at 10 hours each
- Forced alignment (optional but helpful): Frame-level phoneme-to-audio alignment improves duration modeling
The same “garbage in, garbage out” rule applies, but TTS garbage is harder to catch. A bad ASR model outputs text you can read and immediately spot as wrong. A bad TTS model outputs audio you have to listen to carefully, and if you’re not a native speaker of the target language, you might not even notice the errors.
Multi-Stage Fine-Tuning for Language Adaptation
What I’ve seen work (and what the research backs up) is progressive fine-tuning in up to three stages. You don’t always need all three.
Stage 1: Standard Language Foundation (50-100 hours, multi-speaker)
Teach the model the phonological system, prosody patterns, and script of the broader language family. For Arabic, this means MSA data, building a general Arabic acoustic space in the model’s representations. The model learns what Arabic sounds like at a structural level: the rhythm, the phoneme inventory, the stress patterns.
When to skip: if your starting checkpoint already has target language exposure. No point teaching the model something it already knows.
Stage 2: Dialect Adaptation (20-50 hours, diverse speakers)
This is where the model learns to sound like the target dialect. Egyptian Arabic has distinct phoneme substitutions, prosody, and vocabulary that differ from MSA. The model needs many different speakers to learn the dialect’s characteristics rather than memorizing individual voices.
Speaker diversity matters more than total hours at this stage. 50 speakers at 30 minutes each beats 5 speakers at 10 hours each.
LoRA over full fine-tuning. This isn’t just a resource-saving trick. Research consistently shows that full fine-tuning on limited data degrades the base model’s generalization, while LoRA preserves it. You want to steer the model, not overwrite it.
Stage 3: Single Speaker Voice Adaptation (2-5 hours)
Clone or perfect a specific target voice. The model already knows the language and dialect. Now it just needs to learn how one particular person sounds.
Do not exceed 5 hours at this stage. More data causes overfitting. The model memorizes specific utterances instead of learning generalizable voice characteristics. 2 to 5 hours of high-quality audio with good script coverage is the sweet spot.
When to skip stages: if your checkpoint already covers a stage, skip it. Starting from a model that already speaks MSA? Jump to Stage 2. Already have a dialect-adapted model? Go straight to Stage 3.
Zero-Shot as a Diagnostic Tool
Before investing compute in fine-tuning, there’s a free test you should always run: give the model your target language and see what comes out.
If a codec-based model can’t produce recognizable Arabic using its pretrained weights, the codec’s codebook doesn’t have Arabic patterns. Fine-tuning the LLM above it won’t help. This is a structural limitation, not a training one.
I learned this the hard way. I spent time fine-tuning a codec-based approach, watching loss curves improve, feeling optimistic. Then I actually listened to the output: clean delivery, confident pacing, completely wrong phonemes. The model sounded like it was trying to speak Arabic through a mouth that had only ever spoken English. A five-minute zero-shot test before fine-tuning would have told me immediately that the codec was the problem.
For continuous models, poor zero-shot results tell a completely different story. It means you need more data or more training steps, but the architecture itself isn’t blocking you. The path to improvement exists, and that distinction is everything.
Frequently Asked Questions
Can I fine-tune a TTS model to speak Arabic?
Yes, but your results depend almost entirely on the architecture. Continuous models (flow-matching) can learn Arabic phonemes through fine-tuning because mel-spectrograms can represent any sound. Codec-based models are limited by their frozen codebook, and if it was trained mostly on English, Arabic phonemes may have no valid representation.
Why does my TTS model produce gibberish in a new language?
The most common cause is a frozen codec that lacks codebook entries for your target language’s phonemes. The model generates clean-sounding audio because the decoder is well-trained, but the phonemes are wrong because the codec maps unfamiliar sounds to the nearest English approximation. Run a zero-shot test before fine-tuning to catch this early.
What’s better for multilingual TTS, phonemes or graphemes?
It depends on whether you have a reliable grapheme-to-phoneme (G2P) converter for your language. Phonemes provide a universal representation across dialects, but a bad G2P introduces systematic errors that corrupt your training data. If no high-quality G2P exists for your target dialect, graphemes are the safer choice.
How much data do I need to adapt TTS to a new language?
For a three-stage approach: 50-100 hours of multi-speaker data for the standard language foundation, 20-50 hours of diverse speakers for dialect adaptation, and 2-5 hours for single-speaker voice cloning. Quality matters far more than quantity – 10 hours of clean, accurately transcribed audio outperforms 100 hours of noisy data.
Takeaways
After weeks of research, failed experiments, and enough listening sessions to make my ears ring:
-
Architecture > Data > Training recipe. A frozen-codec model with 1,000 hours of Arabic data will never match a continuous model with 50 hours, if the codec can’t represent Arabic phonemes.
-
For new languages: continuous > hybrid (with learnable semantics) > discrete codecs. The less discretization in the pipeline, the fewer hard ceilings you hit.
-
“Sounds clean” and “sounds correct” are completely different things for underrepresented languages. Objective metrics trained mostly on English will mislead you. I saw automated MOS scores give 4.4/5 to audio that native speakers called terrible.
-
Human listening by a native speaker is the only reliable evaluation. Automated metrics are useful for filtering obvious failures, but the final call has to come from someone who actually speaks the language.
-
Check the codec first. If you’re working with a language that has phonemes outside the English inventory, everything else is downstream of that one choice. A five-minute zero-shot test can save you weeks.