Speech-to-Text (STT)
AI technology that converts spoken language into written text, enabling transcription, voice commands, and accessibility features.
Speech-to-text (STT), also called automatic speech recognition (ASR), is AI technology that converts spoken language into written text. Every time you dictate a message, use a voice assistant, or see live captions on a video call, speech-to-text is at work.
How modern STT works
Early speech recognition used acoustic models that matched sound patterns to phonemes and assembled them into words. Modern STT systems use end-to-end deep learning β typically transformer-based models β that directly convert audio waveforms into text.
The process:
- Audio is captured and converted into a spectrogram β a visual representation of sound frequencies over time.
- The spectrogram is processed by a neural network that has been trained on thousands of hours of transcribed speech.
- The model outputs a sequence of tokens that form the transcribed text.
- Language model post-processing corrects likely errors and adds punctuation.
Key STT models
- Whisper (OpenAI): Open-source model with strong multilingual capabilities.
- Deepgram: Enterprise-focused with real-time transcription and customisation.
- Google Speech-to-Text: Cloud API with broad language support.
- AWS Transcribe: Amazon's managed transcription service.
- Azure Speech Service: Microsoft's offering with custom model training.
Factors affecting accuracy
- Audio quality: Clear audio in quiet environments produces the best results. Background noise, echo, and low-quality microphones degrade accuracy.
- Accents and dialects: Models perform better on accents well-represented in training data.
- Domain vocabulary: Technical, medical, or legal terminology may require custom models or vocabulary lists.
- Speaker overlap: Multiple speakers talking simultaneously challenges most systems.
- Language: High-resource languages (English, Mandarin, Spanish) have better recognition than low-resource languages.
Business applications
- Meeting transcription: Automatic notes and summaries from meetings and calls.
- Customer service: Transcribing and analysing support calls for quality and insights.
- Accessibility: Real-time captions for deaf or hard-of-hearing individuals.
- Content creation: Dictation for articles, emails, and documentation.
- Search: Making audio and video content searchable by transcribing it.
- Compliance: Creating audit trails of verbal communications in regulated industries.
Why This Matters
Speech-to-text is one of the most immediately practical AI technologies. It saves hours of manual transcription, makes audio content searchable and analysable, and improves accessibility. Understanding STT capabilities helps you identify opportunities to unlock value from the spoken information flowing through your organisation daily.
Related Terms
Continue learning in Essentials
This topic is covered in our lesson: Beyond Text: Images, Audio, and Video