Experimenting with Speech Augmentations: Enhancing Speech-to-text Model Robustness

TL;DR:

Noise Augmentation is Essential

Adding environmental noise that simulates real-world conditions significantly improved model robustness.
Do not overly augment… Aggressive augmentation hurts accuracy

As you’ll see later in the blog post, using augmentation excessively does more harm than good.
Drop-type augmentations are what worked best for my data which was already noisy

Your mileage may vary here, read point #7 below.
Specific use-cases will need different augmentations

Focused on phone call scenarios ? Channel/Codec Simulation, Noise simulation and Reverberation will probably serve you best here
Combined Strategy Works Best

A carefully balanced mix of augmentations outperformed any single technique.
Try to use an augmentation library that supports GPU operations

Augmentation adds extra compute to the training process, while SpeechBrain is built on top of PyTorch so it should be default use the GPU, that wasn’t the case during my experiments, pretty sure I have set something incorrectly
You HAVE to experiment

As I’m GPU poor, my trial/error was limited to 2 epochs, with different augmentation enabled/disabled. Ideally, I would’ve increased the number of epochs to at least 5

Introduction: Why Augment Speech Data?

In the realm of Automatic Speech Recognition (ASR), the quality and diversity of training data directly impacts model performance. However, collecting large, varied datasets of real-world speech can be expensive and time-consuming. This is where speech augmentation comes in - artificially creating variations of existing audio to expose models to different acoustic conditions they might encounter in the real world.

Think of it like teaching a child to understand speech. They need to learn to comprehend words not just in perfect conditions, but also:

When there’s background noise (like in a café)
When people speak faster or slower
In rooms with different acoustics (bathroom vs. living room)
Through different audio devices (phone vs. speaker)

If you’ve read the previous blog post, you’ll see that we’ve had a highly imbalanced speech dataset where it was largely biased towards male speakers (85%) and the number of speakers was limited, where we had 72 hours containing a maximum of 100 speakers

Remember that to get the best results from fine-tuning a ASR/STT model, we need a dataset that’s:

High-Quality, Consistent Audio: minimal noise, clear audio and high bitrate
Diverse Speaker Representation: good coverage across genders, ages, accents, and dialects so that the model can generalize
Precise, Time-Aligned Transcripts: transcription text has to be aligned with the speech in the audio files
Rich Metadata: Detailed speaker/device/environment tags
Wide Content Coverage: while this might not be absolutely essential in all cases, but conversational, read, and domain-specific speech (e.g., technical jargon, dates, proper names) to build robustness across contexts

Our current curated dataset would easily score an ‘D’ is evaluated across these characteristics… I didn’t want to throw away two weeks worth of work, so I started looking at what’s the best way to make us of it… before moving on to new datasets (and models) with the lessons learnt

The main short comings of my dataset was diversity. Augmentation creates artificial variations in existing data, forcing the model to learn features that are invariant to pitch, speed, noise, and acoustic conditions. It simulates a more diverse dataset.

That’s being said, lets look at the results…. You’ll find Weights and Biases link for (most) of these experiments Also the training code is on GitHub

Hyperparameters and settings

batch_size_dynamic: false
dynamic_batch_numb_buckets:60
eons_index:50,257
epochs:2
grad_accumulation_factor:2
learning_rate:0.00001
loader_batch_size:8
lr_annealing_factor:0.9
lr_improvement_threshold:0.0025
lr_patient:0
lr_warmup_steps:1,000
max_batch_len_seconds:40
max_grad_norm:5
num_beams:5
num_checkpoints_to_keep:2
num_workers:8
optimizer_type:“AdamW”
pad_token_id:50,257
scheduler_type:“NewBob”

All training was done on 1x A100-40GB on Modal Labs with 8x vCPU… cost was around $3.85 (Modal is priced higher than other providers but honestly it’s worth it)

Augmentation Mechanism

I’ve tried to use most of the augmentations made available by the SpeechBrain Library:

Noise Addition: (Custom Wrapper - NoiseSampler):

Noise addition involves mixing various background sounds (e.g., street noise, café chatter, office hum) with the original audio. This helps the ASR model become more robust to real-world environments where clean audio is rare. By training on noisy speech, the model learns to distinguish speech from a multitude of interfering sounds, improving its accuracy in practical applications.

Reverberation (RIR Addition):

Reverberation simulates the acoustic reflections of a room. A Room Impulse Response (RIR) is a recording of how a sound pulse (like a clap) echoes in a specific space. Convolving clean audio with RIRs makes the speech sound as if it were recorded in different environments (e.g., a lecture hall, a small office, a car)

Speed Perturbation:

Speed perturbation involves slightly speeding up or slowing down the audio playback speed without changing the pitch. This simulates natural variations in human speaking rates

DropChunk:

Chunk Dropout, also known as Time Erasing, involves randomly selecting short segments (or “chunks”) of the audio and either silencing them or replacing them with silence/noise. This simulates real-world scenarios like packet loss during audio streaming or brief interruptions.

DropFreq:

Frequency Masking, or Spectral Hole Augmentation (sometimes referred to as DropFreq), operates on the spectrogram of the audio. It involves randomly selecting and masking out (attenuating or silencing) certain frequency bands. This simulates situations where specific frequencies might be lost

Pitch Shift (Manual/Custom Wrapper):

Pitch shifting modifies the frequency (pitch) of the speech signal without changing its speed. This can simulate variations in speakers’ voices (e.g., higher or lower-pitched voices)

DoClip:

Clipping an audio signal means cutting off the parts of the sound wave that exceed a certain loudness level. This simulates what happens when a microphone is overwhelmed by a loud sound or when an audio signal is amplified too much

DropBitResolution:

Bit Depth Reduction (DropBitResolution) involves intentionally lowering the audio quality by reducing the number of bits used to represent each audio sample. This makes the audio sound less precise. which helps the model handle audio that might have been recorded with lower-quality equipment

Gain Wrapper

Volume Scaling involves randomly increasing or decreasing the overall volume of the audio. This simulates variations in recording levels, microphone sensitivity, or a speaker’s distance from the microphone

All of these augmentation were placed in a pool where they are randomly selected according the augmentation probability variable

All of that being said, let’s get into the results

Baseline: all augmentations disabled

Results:

Test WER: 43.55
Test CER: 30.73

Note: seems pretty “standard”, this is our baseline, let’s see how augmentations will improve (or hurt) these numbers

Experiment 1: all augmentations enabled

Max augmentations set to 3 Augmentation probability = 50% (half of the samples are fed into the augmentation pipeline)

Results:

Test WER: 30.81
Test CER: 23.01

Note: aggressive augmentation (max = 3) doesn’t seem to be a good idea, particularly with this dataset that already contains mostly noisy samples, but hey we’re definitely seeing an improvement from our baseline

WandB Run : https://wandb.ai/m-adelomar1/whisper-small-egyptian-arabic/runs/gjlx4ecx/workspace

Experiment 2: Focus on Additive Noise & Reverb to Simulate acoustic mismatch

Augmentations used : Noise and Reverb only
- AddNoise
- AddReverb
Augment Prob = 50%
Max augmentations = 2

Results:

Test WER: 31.78
Test CER: 23.8

Note: less accurate than experiment 1 . again the dataset contains a lot of reverbed and noise containing samples, so increase that alone doesn’t yield the best result

WandB Run : https://wandb.ai/m-adelomar1/whisper-small-egyptian-arabic/runs/u842qt0v/workspace

Experiment 3: Focus on Dropout Augmentations

Augmentations used : Drop chunk and Drop frequency (2)
- drop_chunk, drop_free
2 Epochs
Augment Prob = 50%

Results:

Test WER: 30.10
Test CER: 22.47

Note: while CER is improving, WER seems to have hit a wall

WandB Run : https://wandb.ai/m-adelomar1/whisper-small-egyptian-arabic/runs/p81mrn6f/overview

Experiment 4: Focus on Simulating a phone call

Augmentations used : CodecAugment, DropBit, AddNoise
Augment Prob = 100%
Augmentations = 3 , non-random
2 Epochs

Results:

Test WER: 61.34
Test CER: 41.79

Note: as you can tell , massive performance downgrade, probably need to re-check the augmentation function implementation Also, aggressive augmentation (augment_prob=100) hurts accuracy

WandB Run : https://wandb.ai/m-adelomar1/whisper-small-egyptian-arabic/runs/nno9syu7/overview

Experiment 5: Focus on Dropout Augmentations while adding drop_bit

Augmentations used : Drop chunk and Drop frequency and Drop bit
- drop_chunk, drop_free, drop_bit
2 Epochs
Augment Prob = 50%
Max augmentations = 3 , min = 1

Results:

Test WER: 29.91
Test CER: 22.29

Note: Nice, new best result, although by a slim margin

WandB Run : https://wandb.ai/m-adelomar1/whisper-small-egyptian-arabic/runs/caany6l9

Conclusion & Final Result

So, it seems the dropout augmentation are our best candidate here. Setting the min & max augmentations to 1 and 3 respectively has yielded the best result

So, I’ve taken this experiment’s settings, loaded Whisper-small and started a training run for 10 epochs, the results were interesting :

Results: Test WER: 22.78896 Test CER: 16.76496

Looking at the baseline figures where we started:

Test WER: 43.55
Test CER: 30.73

With the dropout augmentation strategy (Experiment 5 settings) trained for 10 epochs, we achieved a:

Test WER of 22.78896
and a Test CER of 16.76496

This represents approximately a 48% improvement in WER and a 45% improvement in CER compared to our baseline!

Valid WER for the different experiments…. note that Ablation 6 is actually experiment 5

And that, my friends, is why you should seriously consider implementing speech augmentation in your ASR/STT training pipelines.

Testing the model in real life with pre-recorded audio…. It didn’t yield the best results in comparison with Azure’s STT and OpenAI’s GPT4o-transcribe

But then again, Whisper-Small is a model with 244M parameters, so I didn’t really have high expectations for it to beat SOTA models served by trillion dollar companies

Next, I’ll be using a larger variant of Whisper, as well as experiments with an E2E Speech Model, lets see how that goes!