Most people know Whisper as a transcription model. Speak English, get English text. That's how almost every Whisper-based app uses it.
But Whisper was actually designed as a translation model first. When OpenAI trained it on 680,000 hours of audio, they included multilingual audio with English transcriptions specifically to enable cross-lingual translation. The transcription use case is almost a byproduct of what the model was built to do.
That architectural detail is why Tellaflow can offer 65-language live translation without a separate translation API and why it all works offline.
Whisper's two modes
Whisper has two inference modes:
- Transcribe mode: Output text in the same language as the input audio. Speak French, get French text.
- Translate mode: Output English text regardless of the input language. Speak French, get English text.
Cloud voice tools typically expose transcribe mode. Translation is sold as a premium add-on, usually routed through a separate API (Google Translate, DeepL, etc.). That adds latency, cost, and another service that processes your content.
Tellaflow uses Whisper's built-in translate mode. No external API. Same model, different inference flag. The translation quality is part of the Whisper model itself not post-processing.
The pipeline
Here's what happens when you speak in, say, Hindi with live translation enabled:
- Audio capture: Tellaflow records your speech in 30-second sliding windows (Whisper's optimal input length) using macOS's Core Audio APIs.
- Voice activity detection: We detect speech endpoints when you pause or stop speaking to determine when to send audio to the model.
- Whisper inference (translate mode): The audio buffer is passed to Whisper with the
task=translateflag. The model returns English text regardless of the input language. - Text injection: We use macOS Accessibility APIs to inject the translated text into whatever application has focus the same as a keyboard input event.
Steps 1–4 happen in under a second for most speech segments on Apple Silicon M-chips.
Why 65 languages and not more
Whisper's training data included audio in 99 languages. However, Whisper's own documentation acknowledges that performance varies significantly by language specifically by how much audio was available for training.
We advertise 65 languages because those are the languages where Whisper's word error rate is under a threshold we consider practically useful. The other 34 languages technically "work" but with enough errors that we didn't want to ship them as a feature.
As Whisper models improve and as we test more, this number will go up.
The accuracy question
Translation quality from Whisper is good but not perfect. It's better than Google Translate for conversational speech and worse than a professional human translator for nuanced technical content. For most dictation use cases notes, emails, drafts it's easily good enough.
The main limitation: idiomatic expressions and culturally-specific references sometimes translate awkwardly. Proper nouns occasionally get phonetically transcribed rather than translated. These are model limitations that will improve over time, not architectural ones.
What this means for users
If you speak a language other than English but work in English a common situation for developers, researchers, and professionals at global companies this means you can think and speak in your most comfortable language and have English text appear in your documents. Entirely offline.
That's genuinely useful. And it exists because OpenAI trained translation capability into Whisper from the beginning it was just waiting for someone to expose it without putting a paywall in front of it.