ASR

Phonemes, Graphemes, and Subword Units Explained

ASR systems don’t predict words directly from audio. Instead, they use phonemes, graphemes, or subword units as intermediate representations. This guide explains how these units work and how they affect accuracy, latency, and real-time ASR design.

Mohammed

17 Jan 2026 • 2 min read

How Speech Recognition Models Represent Language

Automatic Speech Recognition (ASR) systems do not directly output “words” from audio.Instead, they predict intermediate linguistic units that are later decoded into text.

This document explains phonemes, graphemes, and subword units, why they exist, and how choosing between them impacts accuracy, latency, scalability, and real-time performance.

Overview

In ASR, one fundamental design decision is:

What linguistic unit should the model predict from audio?

The three most common choices are:

Phonemes (speech sounds)
Graphemes (written characters)
Subword units (fragments of words)

Each choice shapes:

Model architecture
Training data requirements
Error behavior
Real-time latency
Domain adaptability

Why Output Units Matter in ASR

Audio is continuous, but text is discrete.ASR models must bridge this gap by predicting a sequence of symbols.

The symbol type determines:

How pronunciation is handled
How well rare words are recognized
How easily the system scales across languages
How complex decoding becomes

There is no universally best choice, only trade-offs.

Phonemes

A phoneme is the smallest unit of sound that changes meaning.

Examples:

bat vs pat → /b/ vs /p/
brioche → /b r i ʃ/
burrito → /b ə r i t o/

Phonemes represent how speech sounds, not how it is spelled.

Why Use Phonemes in ASR?

Advantages

Robust to accents and pronunciation variation
Better handling of unseen or rare words
Strong performance in noisy environments

Disadvantages

Requires pronunciation dictionaries (lexicons)
Language-specific
More complex decoding pipeline

Phoneme-based systems are common in high-accuracy, domain-specific ASR.

Here are some phonemes based ASRs:

NeMo
SpeechBrain

Graphemes

A grapheme is the smallest unit of written language.

Examples:

cat → c + a + t
brioche → b r i o c h e

Grapheme-based ASR predicts characters directly from audio.

Here are some grapheme based ASRs:

DeepSpeech
Wav2Vec 2.0

Why Use Graphemes in ASR?

Advantages

Simple training pipeline
No pronunciation dictionary required
Easier multilingual support

Disadvantages

Struggles with irregular spelling
Sensitive to accents
Harder to disambiguate homophones

Grapheme-based ASR is popular in end-to-end neural models.

Subword Units

Subword units sit between characters and full words.

Examples:

brioche → bri + oche
unbelievable → un + believ + able

Common techniques:

Byte Pair Encoding (BPE)
WordPiece
SentencePiece

Why Modern ASR Uses Subwords

Advantages

Handles rare words better than characters
Smaller vocabulary than word-level models
Language-agnostic
Strong balance of accuracy and simplicity

Disadvantages

Tokenization complexity
Less interpretable than phonemes

Subwords dominate modern transformer-based ASR systems.

Side-by-Side Comparison

Unit Type	Represents	Strengths	Weaknesses
Phonemes	Sounds	Accuracy, accent handling	Complex, language-specific
Graphemes	Characters	Simplicity, multilingual	Poor with irregular spelling
Subwords	Word fragments	Balanced, scalable	Tokenization overhead

Impact on Real-Time ASR

Latency

Phonemes: extra decoding step → slightly higher latency
Graphemes: fastest decoding
Subwords: balanced latency

Accuracy

Phonemes excel in noisy, domain-specific speech
Subwords excel in general-purpose ASR
Graphemes struggle with ambiguous pronunciations

Common ASR Confusion Cases

Why “brioche” becomes “burrito”

Similar phoneme sequences
Language model bias toward frequent words
Missing domain vocabulary

Mitigations

Phrase boosting
Context-aware language models
Phoneme-level biasing

Choosing the Right Unit

Use phonemes if:

You control the vocabulary
Accuracy matters more than simplicity
You can maintain lexicons

Use graphemes if:

You want fast iteration
You support many languages
Simplicity is the priority

Use subwords if:

You want production-ready balance
You use transformer-based models
You care about scalability and maintainability

Production Trade-Offs

Concern	Phonemes	Graphemes	Subwords
Training complexity	High	Low	Medium
Multilingual support	Low	High	High
Domain adaptation	Excellent	Limited	Strong
Maintenance cost	High	Low	Medium

Key Takeaways

Output units are a system design decision, not an implementation detail
Phonemes favor accuracy and control
Graphemes favor simplicity and speed
Subwords offer the best production trade-off
Real-time ASR magnifies all trade-offs

Great ASR systems are engineered, not just trained.