How I picked a Nepali voice for kchakhabar.com (and cut the bill by 90%)

How I picked a Nepali voice for kchakhabar.com (and cut the bill by 90%)
TTSNepaliBuild in PublicK cha khabar

A few days ago I shipped vertical, AI-narrated video versions of every top story on kchakhabar.com. Hourly. Both languages. The first day it ran in production I checked the bill and thought we cannot keep doing this.

The narration was running on ElevenLabs Multilingual v3. I had picked it for one reason and one reason only: it was the only TTS provider whose Nepali voice sounded like a real news anchor. I had tried half the internet before launch. ElevenLabs let me pick a Hindi-trained library voice that, by virtue of script and phoneme overlap, reads Nepali better than the dedicated Nepali options I’d tried elsewhere..

The cost was about 30 cents per minute of audio. 15 to 20 cents per 30-second clip. At 48 videos a day across both languages, that is roughly $7 a day. For a one-person hobby project with a tiny but lovely audience, it was the wrong size of bill.

So I did the boring thing. I ran a shootout.

Same script, four providers

I picked one real story from prod — Prime Minister Balendra Shah ordering political student wings out of universities — and ran the first sentence through every provider that plausibly handled Nepali. The text:

विश्वविद्यालयबाट दलीय विद्यार्थी संगठन हटाउन प्रधानमन्त्री शाहको निर्देशन

प्रधानमन्त्री बालेन्द्र शाहले काठमाडौंस्थित प्रधानमन्त्री तथा मन्त्रिपरिषद्को कार्यालयमा ११ विश्वविद्यालय र सात प्रतिष्ठानका उपकुलपतिहरूसँग साढे तीन घण्टा छलफल गरी शैक्षिक संस्थाहरूबाट दलीय विद्यार्थी तथा कर्मचारी संगठनका संरचना तत्काल हटाउन निर्देशन दिएका छन्। विश्वविद्यालयका कुलपति समेत रहेका प्रधानमन्त्री शाहले कुनै पनि बहानामा शैक्षिक संस्थामा राजनीति गर्न नहुने अडान लिए। सरकारको सुशासन कार्ययोजनाअनुसार ६० दिनभित्र दलीय संगठन हटाउने र ९० दिनभित्र विद्यार्थी काउन्सिल संयन्त्र विकास गर्ने लक्ष्य राखिएको छ।

Listen
Nepali news script · same text, four providers
0:00 / 0:00

And the bilingual version — English narration with Nepali proper nouns dropped in. Listen for Bagmati, Singha Durbar, Swornim Wagle, fiscal year 2083-84:

Listen
English script with Nepali proper nouns · code-mixing test
0:00 / 0:00

1. ElevenLabs eleven_v3 — the incumbent

The voice I had launched with. Excellent Nepali prosody. Native word timestamps. ~$0.30/min. Produces audio that I have, to date, not had a single complaint about.

2. Azure Neural TTS — the big-cloud surprise

Microsoft quietly added Nepali support to Azure Speech in late 2024: ne-NP-HemkalaNeural (female) and ne-NP-SagarNeural (male). Both are full neural-quality voices. Pricing: $15 per 1M characters for Standard, $22 for Neural HD (recently dropped from $30). That works out to about $0.04/min.

The voice quality is not exactly on the same level as ElevenLabs for news content.

3. Deepgram Aura-2 — the cheapest. The wrongest.

On paper Aura-2 was the most exciting option. $0.030 per 1,000 characters. Sub-200ms latency. Real-time-friendly. I went looking for the Nepali voice.

There isn’t one. Aura-2 supports seven languages: English, Spanish, Dutch, French, German, Italian, Japanese. That is the full list. No Nepali. No Hindi. No anything from the subcontinent.

For documentation purposes I rendered the Nepali script through Aura-2’s English voice (Thalia) anyway. The result is genuinely instructive — a confident American newsreader confidently mispronouncing every Devanagari syllable. Vishwavidyalayabata daliya… delivered as though it were Latin. Useful proof of why “supports the language” has to be the first filter, not the last.

4. Gemini 3.1 Flash TTS — the landing

gemini-3.1-flash-tts-preview is a multi-voice multi-speaker TTS model. 30 prebuilt voices. Auto-detects input language across 24+ languages, including Nepali. Voice quality on Nepali: comparable to ElevenLabs to my ear, with a few quirks. Default voice: Aoede.

The pricing math is fun. Text input is $0.50 per 1M tokens, audio output is $20 per 1M tokens, and audio comes out at roughly 25 tokens per second. So 60 seconds of audio = 1500 tokens = 3 cents. About $0.03/min. Comparable to Deepgram on price, comparable to ElevenLabs on quality, and unlike Deepgram it actually speaks the language.

Production switched to Gemini. The narration bill went from ~$7/day to under $0.70/day. ~10× saving, voice quality survived intact.

The one thing I gave up

ElevenLabs and Azure both return word-level timestamps. You hand them a sentence; they hand back audio plus an array like [{word: "नमस्ते", start: 0.0, end: 0.42}, ...]. From that you can render perfectly-aligned captions, animate words on hit, sync visuals to specific phrases. It is a feature you don’t realise you depend on until you don’t have it.

Gemini doesn’t return alignment. The TTS API gives you audio bytes and a content type. You can estimate timings from total duration divided by word count, which is “good enough” for non-interactive narration but visibly off if your captions try to highlight individual words as they’re spoken.

For kchakhabar’s vertical-video format we live with it. Captions render paragraph-by-paragraph rather than word-by-word. Nobody has complained. But it is the one feature ElevenLabs gave for free that I had to give up.

Three flavors of expressivity (a documentation aside)

Vidgen on kchakhabar wants neutral newsroom delivery, no styled excitement, no whispered asides. But all three of the in-contention providers support performative TTS, and they each interface with it in a wildly different way. Worth documenting because most TTS comparison articles on the internet skip this, and the design choices are the most interesting thing happening in voice synthesis right now.

ElevenLabs eleven_v3 uses inline bracket tags inside the text:

[excited] Big news today, friends!
[pause] [whispers] But here's a secret only insiders know...
[laughs] Just kidding —
[smiles] welcome to K Cha Khabar.

There is a published vocabulary: emotional ([excited], [sad], [curious]), performative ([whispers], [laughs], [sighs]), pacing ([pause], [break]). Each tag is a directive applied to the following clause.

Azure Neural TTS uses SSML with multi-style voices:

<mstts:express-as style="excited" styledegree="2">
  Big news today, friends!
</mstts:express-as>
<break time="600ms"/>
<mstts:express-as style="whispering">
  But here's a secret only insiders know...
</mstts:express-as>

Gemini 3.1 Flash TTS uses a natural-language style directive prepended to the text:

Read the following with shifting emotion. Start with bright excitement,
drop to a conspiratorial whisper, laugh playfully, then smile your way
through the close:

"Big news today, friends! ..."

No fixed vocabulary. You describe the performance in prose; the model follows. Less deterministic than ElevenLabs’ tags, but more flexible — anything you can describe, you can ask for. “Read this in the voice of a tired Kathmandu uncle who has explained this point twice already” is an example that works well.

Same script. Three completely different rendering paradigms. All three sound expressive. None of them is in K cha khabar’s hot path (newsroom voice is neutral by design) but the future of TTS is performative.

Here is what each one actually does with it — same emotional arc, three syntaxes, three results:

Listen
Performative TTS · excited → whisper → laugh → smile
0:00 / 0:00

What it costs now

ProviderVoicePer minuteQuality on NepaliCaptions
ElevenLabs eleven_v3Madhusmita (Hindi library voice)$0.30 – $0.40excellentword-level, free
Azure Neuralne-NP-HemkalaNeural~$0.04goodword-boundary events
Deepgram Aura-2~$0.03n/a (no Nepali)n/a
Gemini 3.1 Flash TTSAoede~$0.03excellentduration-estimated

K cha khabar runs on row 4 today. Narration bill: down 90%. Voice quality: kept. Captions: rougher but lived.

The lesson, the same one as last time

The previous post on this site made a similar argument about the LLM that does the cluster summaries: the cheapest model that’s good enough is the one that ships. The TTS layer told me the same lesson, in the same project, two weeks later. The benchmark, even an informal one with five samples and a critical ear, is the part that turns good enough from a feeling into a number.

If you are building a Nepali product and need a voice, the short version:

  • Best-in-class, money no object → ElevenLabs with a custom clone.
  • Cheapest with native Nepali quality → Gemini 3.1 Flash TTS, voice Aoede.
  • Skip → Deepgram Aura-2 (no Nepali).

Production runs on Gemini. The captions don’t always hit the word, but they hit the point. The voice ships.

Back to Blog