Practical

Speech-to-Text (STT)

Last reviewed: April 2026

AI technology that converts spoken language into written text, enabling transcription, voice commands, and accessibility features.

Speech-to-text (STT), also called automatic speech recognition (ASR), is AI technology that converts spoken language into written text. Every time you dictate a message, use a voice assistant, or see live captions on a video call, speech-to-text is at work.

How modern STT works

Early speech recognition used acoustic models that matched sound patterns to phonemes and assembled them into words. Modern STT systems use end-to-end deep learning — typically transformer-based models — that directly convert audio waveforms into text.

The process:

Audio is captured and converted into a spectrogram — a visual representation of sound frequencies over time.
The spectrogram is processed by a neural network that has been trained on thousands of hours of transcribed speech.
The model outputs a sequence of tokens that form the transcribed text.
Language model post-processing corrects likely errors and adds punctuation.

Key STT models

Whisper (OpenAI): Open-source model with strong multilingual capabilities.
Deepgram: Enterprise-focused with real-time transcription and customisation.
Google Speech-to-Text: Cloud API with broad language support.
AWS Transcribe: Amazon's managed transcription service.
Azure Speech Service: Microsoft's offering with custom model training.

Factors affecting accuracy

Audio quality: Clear audio in quiet environments produces the best results. Background noise, echo, and low-quality microphones degrade accuracy.
Accents and dialects: Models perform better on accents well-represented in training data.
Domain vocabulary: Technical, medical, or legal terminology may require custom models or vocabulary lists.
Speaker overlap: Multiple speakers talking simultaneously challenges most systems.
Language: High-resource languages (English, Mandarin, Spanish) have better recognition than low-resource languages.

Business applications

Meeting transcription: Automatic notes and summaries from meetings and calls.
Customer service: Transcribing and analysing support calls for quality and insights.
Accessibility: Real-time captions for deaf or hard-of-hearing individuals.
Content creation: Dictation for articles, emails, and documentation.
Search: Making audio and video content searchable by transcribing it.
Compliance: Creating audit trails of verbal communications in regulated industries.

Want to go deeper?

This topic is covered in our Essentials level. Access all 100+ lessons free.

Why This Matters

Speech-to-text is one of the most immediately practical AI technologies. It saves hours of manual transcription, makes audio content searchable and analysable, and improves accessibility. Understanding STT capabilities helps you identify opportunities to unlock value from the spoken information flowing through your organisation daily.

Related Terms

Natural Language Processing (NLP)

The branch of AI focused on enabling computers to understand, interpret, and generate human language in useful ways.

Text-to-Speech (TTS)

AI technology that converts written text into natural-sounding spoken audio, enabling voice interfaces, audiobooks, and accessibility features.

Multimodal AI

AI systems that can process and generate multiple types of content — text, images, audio, video — rather than just text alone.

Real-Time AI

AI systems that process input and produce output fast enough to support live interactions — voice conversations, live video analysis, or instant recommendations.

Learn More

Continue learning in Essentials

This topic is covered in our lesson: Beyond Text: Images, Audio, and Video

← Back to Glossary