In 1876, Alexander Graham Bell transmitted the first intelligible voice over a wire. In 1952, Bell Labs built "Audrey" the first speech recognition machine, capable of recognizing spoken digits. In 1990, Dragon Dictate shipped as the first commercially available voice recognition product, at $9,000 per copy.
In 2025, most knowledge workers still type everything.
That's a weird fact. The technology has existed for decades. The accuracy has been good enough for most use cases for at least ten years. And yet keyboards remain the dominant input method for written communication. Why?
The trust problem
The answer isn't technical capability. Dragon Dictate was usable in 1990 for a reason. The answer is that every voice tool since has required you to trust someone.
When you type, you trust your keyboard (reasonable), your OS (reasonable), and the app you're typing into (reasonable). That's it.
When you use a cloud voice tool, you trust all of those plus: the voice tool vendor, their server infrastructure, their privacy policy, their API providers, their security practices, and whatever they might do with your data if their business model needs to change.
That's a much longer chain of trust. For casual use asking Siri the weather it's fine. For dictating sensitive documents, legal notes, or confidential strategy, it's a real concern that reasonable people reasonably decline to take on.
The cloud made it worse before making it better
Cloud computing enabled dramatically better speech recognition. Google, Amazon, and Microsoft all deployed massive models trained on billions of utterances. The accuracy improved significantly. But the tradeoff was that your audio had to go to their servers to benefit.
This created an awkward situation: to get good voice recognition, you had to accept cloud processing. For many use cases, that was a deal-breaker.
What changed in 2022 was Whisper. OpenAI released a model good enough to replace cloud services, available for anyone to download and run locally. And in 2020, Apple shipped the first M1 chips with Neural Engines fast enough to run that model at useful speeds on a laptop.
The trust problem has been solved. It just hasn't been packaged into tools that make the tradeoff obvious.
Why voice is 3× faster (and why that matters)
The average person speaks at 130–150 words per minute. The average person types at 40–60 words per minute.
That's a 2.5–3× throughput difference. For a knowledge worker who spends 2 hours a day typing, that's 438 hours a year lost to input speed eleven 40-hour workweeks.
But throughput isn't the only factor. There's something more subtle: the cognitive friction of typing. When you speak, you think and output simultaneously. When you type, there's a layer of motor translation between your thought and the screen. For most people, that layer has a measurable effect on the quality of the output not just the speed.
Writers report this. Programmers report it with code comments. Executives report it with emails. The thinking doesn't always survive the bottleneck intact.
What the adoption curve actually looks like
Voice input adoption has historically followed a predictable pattern: people try it, find the accuracy disappointing or the privacy concerning, and go back to typing. The technology improves. People try again.
We're at the point in that curve where accuracy is solved (Whisper) and privacy is solvable (on-device inference). The remaining friction is habitual people are used to keyboards and haven't had a compelling reason to change.
The compelling reason is 3× faster output with no privacy tradeoff and no monthly bill. We think that's enough. The keyboard got its 150 years. It's had a good run.