Curating Custom Datasets for Arabic Speech-to-text Models:

A Case Study on what not to do: Lessons learned curating diverse Egyptian Arabic speech datasets for training high-quality ASR models

TL;DR — Lessons Learned

  • Quantity ≠ Quality: Aggregating large volumes without upfront vetting led to wasted effort. Prioritize clean, well-understood data over raw hour count.

  • Alignment is Non-Negotiable: Misaligned audio-text pairs (e.g., long audio + short transcripts) severely degrade model training. Catch and filter early.

  • Pre-Aggregation Auditing is Essential: Evaluate each dataset’s structure, transcription standards, and alignment quality before integration.

  • Speaker Diversity Matters: A highly skewed gender balance (~85% male) and limited speaker variety introduce bias and reduce generalization. Track demographics from the start.

  • Good Datasets are Built, Not Collected: Effective dataset creation requires deliberate curation—source validation, targeted filtering, and iterative refinement—not indiscriminate scraping.

Introduction: Motivation

Automatic Speech Recognition (ASR) models for English have achieved remarkable accuracy, effectively becoming a mature technology.

New Development in English ASR often focuses on optimizing latency or efficiency for deployment on resource-constrained devices. For many applications, English ASR can be considered a largely solved problem.

But step outside English, and things quickly look less rosy. Arabic, my own linguistic home turf, is tied with English as the 3rd/4th most spoken first language worldwide, according to the CIA World Factbook (circa 2018).

Bar chart showing most spoken first languages globally around 2018, highlighting Arabic’s position

Yet, finding quality, open-source ASR for Arabic—and specifically for dialects like my native Egyptian Arabic—still feels less like data science and more like hunting for a very particular grain of sand in a massive desert. This curated data is essential for tasks like fine-tuning Whisper on Arabic dialects or building robust speech-to-text models for Arabic from scratch.

Now, conventional wisdom recommends using massive, meticulously organized datasets like Mozilla’s Common Voice. It’s a logical and practical approach.

So, obviously, I chose not to do that :)

This project explored an alternative strategy. Several smaller, potentially overlooked Egyptian Arabic datasets exist. The hypothesis was: Could combining these varied, potentially less pristine sources into one larger dataset yield better results for model training?

This blog post documents that process—detailing the challenges encountered and lessons learned from this experiment in dataset aggregation and curation.

The Great Dataset Mashup & Initial Analysis

The first step involved identifying and acquiring available Egyptian Arabic speech datasets. Approximately six distinct datasets were located from various online sources, exhibiting varying levels of documentation quality.

Using a Jupyter notebook on Jarvis Labs and the digital equivalent of duct tape and prayers, I merged the 6 data sources found online into one big dataset. my df.info() moment revealed… 570 hours of audio! For a low-resource dialect, this felt like striking oil. Visions of state-of-the-art accuracy danced in my head.

However, the process of combining datasets collected under varying conditions and standards is often less straightforward than it appears. While technically unified, the resulting aggregation frequently inherits the quality issues of its individual sources.

A closer look at the sources revealed the scale of the challenge:

  • Dataset A (~465 hours): Constituting the majority of the data, this source exhibited problematic audio segmentation. Clips frequently overlapped or contained abrupt cuts mid-word. Furthermore, initial assessment suggested a potential lack of speaker diversity, possibly featuring only two primary speakers. This raised concerns about potential model bias.
  • Dataset B (~65 hours): Showed similar characteristics to Dataset A, albeit smaller in scale. Overlapping audio, questionable segmentation, and likely limited speaker diversity were observed.
  • Dataset C (~5 hours): Presented a unique structure. It contained only ~15 files, but each represented a long conversational recording (15-20 minutes). Analysis suggested potential duplication, possibly with separate channels for each speaker in the conversation.
  • Dataset D (~2.5 hours): Sourced from YouTube clips, this dataset appeared relatively well-structured with reasonable segmentation quality, serving as a benchmark for comparison. (The remaining data came from smaller, less distinct sources).

Therefore, the initial 570-hour figure required significant qualification. The breakdown by source highlighted critical potential issues: severe speaker imbalance, inconsistent segmentation, and unusual recording formats, indicating a substantial data cleaning and filtering effort would be necessary.

This context underscored the importance of the subsequent analysis steps.

Step 1: The First Glance (Nvidia NeMo SDE)

  • Why? To quickly grasp the general shape of the aggregated dataset, especially given the known quirks each source brought along.
  • Findings: Initial exploration confirmed expected issues: diverse character sets, vocabulary variations, and a duration distribution skewed by the extremely long files from Dataset C and segmentation overlaps from Datasets A and B.

Step 2: The Metadata Mirage (using Pandas)

  • Why? To check if metadata like speaker IDs, gender, or recording devices could help unravel dataset confusion.
  • Findings: Analysis revealed that metadata fields (speaker IDs, gender, device information) were largely incomplete or contained non-informative placeholder strings, limiting their utility for disambiguation.

Step 3: Duration Dilemmas (Pandas)

  • Why? To measure precisely how problematic file durations were, especially now suspecting Source C and the overlap issues from A/B.
  • Findings: Quantitative analysis confirmed the presence of numerous very short audio snippets (<0.5s), likely noise or errors, alongside extremely long files (>30s), primarily from Dataset C. While the majority of files fell within a typical duration range, these outliers represented significant data quality problems.

Step 4: Transcription Consistency Audit (Pandas & Python)

  • Why? If audio formats varied this drastically, the transcripts were probably a mess as well.
  • Findings: Analysis confirmed significant inconsistencies in transcription standards across sources. A large number (168) of unique characters were present, indicating varied use of diacritics (Tashkeel), digits, punctuation, and inconsistent handling of specific letters (e.g., ‘g’/‘ج’). Some very short transcripts were found to be linguistically valid upon closer inspection.

List of unique non-standard characters found in raw Arabic dataset transcripts before cleaning

Step 5: Speech Rate Reality Check (Pandas)

  • Why? To see if speech rates correlated with the known duration issues, potentially spotlighting further source-specific problems.
  • Findings: A strong correlation was observed between abnormal speech rates (both very high and very low) and problematic file durations. This further implicated Dataset C’s long files and the segmentation issues in Datasets A and B as major sources requiring targeted filtering.

Exploration Summary:

The initial 570 hours represented a complex aggregation of varying quality. The analysis revealed:

  • Dominance by two sources (A & B) with questionable segmentation and limited speaker diversity.
  • A bizarre outlier source (C) with extremely long, possibly duplicated conversational files.
  • At least one relatively clean source (D), showing it was possible.
  • Sparse and unreliable metadata across most of the aggregated data.
  • Wildly inconsistent and noisy transcriptions, likely reflecting the standards (or lack thereof) of the original datasets.

It quickly became clear that this wasn’t just a big dataset—it was a patchwork of recording styles and transcription habits, each with its own logic (or lack thereof). Making sense of it all meant going beyond surface-level fixes. A proper cleanup and normalization pipeline wasn’t just helpful—it was necessary. That’s what came next.

The Pre-processing Gauntlet

The next phase involved intensive data pre-processing and cleaning.

The Basics - Format Wrangling & Sanity Checks

  • Action: All audio files were converted to a standard WAV format with a 16kHz sample rate to ensure consistency. Associated metadata (transcripts, source identifiers) were maintained alongside the audio.
  • Next: Standard data hygiene steps were performed: exact duplicates (based on audio path or transcript content) were removed; entries missing essential information (audio path, transcript, duration) were dropped; durations were calculated for all remaining files; and file path validity was verified.
  • Rationale: Standardize inputs, eliminate redundancy, ensure data integrity. Boring but essential groundwork. Skipping this is like building a house on quicksand.

Metadata Diet - Less is More

  • Action: Columns with extremely high sparsity (over 99% missing values, after accounting for empty strings) were removed. The essential columns retained were: audio_filepath, duration, text, and dataset_source.
  • Rationale: Simplify the dataset, remove noise, focus on what matters for ASR. A lean dataset is a happy dataset.

The Great Text Detox

Transcript cleaning was a major focus due to the high level of inconsistency.

  • Action: A multi-pronged attack:
    1. Character Normalization: Non-essential characters identified during exploration were removed. This included Arabic diacritics (Tashkeel), Tatweel (ـ), most punctuation marks, English letters, and various symbols (e.g., \n, \xa0, ٪). The focus was on retaining core Arabic letters and spaces.
    2. Number Conversion: Both Western (0-9) and Eastern Arabic (٠-٩) numerals were converted to their full Arabic word representations (e.g., “3” or “٣” became “ثلاثة”) using the num2words library, as ASR models typically process text.
    3. Letter Correction: The frequently misused ‘g’ character was globally replaced with the correct Arabic letter ‘ج’ (Jeem).
    4. Whitespace Normalization: Multiple consecutive spaces were collapsed into single spaces, and leading/trailing whitespace was removed from all transcripts.
    5. Missing Space Correction: An issue observed was the erroneous merging of valid words (e.g., عزيزيحلم instead of عزيزي حلم). This likely stemmed from OCR errors or inconsistent annotation practices in the source datasets. Such merged forms represent invalid words and would negatively impact model tokenization and learning.
      • Solution: The CAMeL Tools library, specifically its morphological analyzer for Egyptian Arabic, was employed. A custom function evaluated each potential “word.” If deemed invalid by the analyzer, the function attempted splits at all possible positions (e.g., عزيزي + حلم). If a split resulted in two parts, both recognized as valid morphological units by the analyzer, a space was inserted at the split point. A simple heuristic prioritized splits where the first part was longer. This approach successfully corrected many valid transcripts affected by this formatting issue.
  • Rationale: Create clean, consistent, linguistically sound text that a model can actually learn from. Removing noise, standardizing formats, and fixing structural errors like missing spaces are crucial for effective training.

Diagnostics Sidebar: The Rate Detective

Before aggressive filtering, Character Rate (characters/second) and Word Rate (words/second) were calculated as diagnostic metrics.

  • What we did: Calculated these rates after initial cleaning but before filtering. We looked for extreme outliers, especially super low rates.
  • Findings: The rate analysis quantitatively confirmed the suspected misalignment issues. Extremely long audio files (especially from Dataset C, but also observed in shorter misaligned clips from Dataset A) paired with very short transcripts resulted in character/word rates approaching zero (e.g., 0.01 chars/sec). This provided strong evidence of severe audio-text mismatch, where the transcript represented only a small fraction of the recorded audio duration. This justified subsequent filtering based on duration.

Armed with clean text and rate analysis insights, we filtered aggressively:

  • Action: Samples were removed based on the following criteria:
    1. Audio duration less than 0.5 seconds (likely noise or segmentation errors).
    2. Audio duration greater than 25 seconds (targeting the severely mismatched long files identified earlier).
    3. Transcript became empty after the cleaning process (indicating original content was entirely noise or non-linguistic symbols).
  • Rationale: Remove unusable segments, severely mismatched pairs, and samples left without text after normalization. Focus the model on learnable, reasonably well-aligned data. We explicitly didn’t filter purely on short text length, as many short transcripts were valid words.

Whops!

Final Prep, Pack, and Ship

Almost there!

  • Action:
    1. Copied only the filtered audio files to a clean directory.
    2. Updated the manifest file (.jsonl) to use relative paths pointing to this new directory (essential for portability).
    3. Kept only the necessary columns (audio_filepath, duration, text, dataset_source).
    4. Randomly shuffled the dataset (crucial as we lacked speaker info for stratified splits).
    5. Uploaded the cleaned audio files and the final manifest to a private Hugging Face Hub repository.
    6. Loaded the dataset from the Hub, performed a standard 80% train / 10% validation / 10% test split, and pushed this DatasetDict back to the Hub.
  • Rationale: Create a portable, clean, ready-to-use dataset package. Version control via the Hub. Standard splits for proper model training and evaluation.

Reflections: Mistakes Made and Lessons Learned

Hindsight offers perfect clarity, and looking back at this project reveals several key missteps.

From an initial aggregation of 570 hours, rigorous cleaning and filtering, primarily due to alignment issues, resulted in a final dataset of approximately 72 hours. This dramatic reduction was due to the following points:

  1. Quantity Bias & Insufficient Vetting: The primary mistake was prioritizing the sheer volume of data over rigorous upfront quality assessment. Driven by the allure of a large hour count (“more data = better model!”), diverse datasets were aggregated before thoroughly investigating the potential flaws, and annotation standards of each source.

    1. This “collect first, check later” approach meant significant effort was wasted processing data that was fundamentally unusable, necessitating a massive, time-consuming salvage operation later. Knowing the quirks of each source before integration is crucial.
  2. Ignoring Severe Misalignment: Stemming directly from insufficient vetting, the most critical technical oversight was incorporating a large dataset (~60% of the initial 570 hours!) where audio files were vastly longer than their corresponding transcripts. Attempting to train a model on pairs where a short text supposedly represents minutes of unrelated audio is fundamentally flawed and provides incorrect learning signals. This misalignment prevents effective training of accurate Arabic ASR models. This single issue was the primary driver behind the Great Dataset Shrinkage. Alignment isn’t just important; it’s non-negotiable.

  3. Neglecting Representation & Diversity: attention wasn’t paid to demographic balance during dataset selection. Consequently, the final dataset suffered from significant gender skew (~85% male) and poor speaker diversity, potentially dominated by a few voices and lacking reliable speaker IDs for proper stratification. Training on such imbalanced data risks creating biased speech-to-text models for Arabic that fail to generalize well to new speakers. Addressing these issues post-hoc relies heavily on less ideal mitigation strategies like data augmentation and random shuffling.

Key Takeaways (The Hard-Won Wisdom)

This journey reinforces several core principles for dataset curation:

  • Vet Sources First, Aggregate Later: Quality trumps quantity. A smaller, clean, well-understood dataset is far more valuable than a massive, messy one. Test your pipeline on samples before committing to full-scale processing.

  • Prioritize Alignment: Ensure audio and transcripts reasonably correspond. Detect and discard severely misaligned data early and ruthlessly.

  • Mind the Demographics: Actively consider gender balance and speaker diversity during selection. Seek reliable labels to enable stratified splits and measure potential bias. Aim for many speakers with balanced contributions.

Ultimately, building effective custom datasets requires curation – careful selection, meticulous cleaning, and critical analysis – not just data hoarding. It’s about understanding the nuances of your data, anticipating downstream impacts, and being prepared to make A LOT of filtering decisions.

Tools That Became My Unlikely Sidekicks

Every messy project introduces you to new tools or forces you to appreciate old ones in new ways. Here were the standouts in this adventure:

  1. Nvidia NeMo Speech Data Explorer (SDE): The Quick Visualizer

    • Role: My first choice getting a high-level, visual sanity check on the data.

    • Why it’s useful: It was great for quickly loading the initial manifest and seeing the big picture: overall duration, histogram shapes (spotting those duration outliers immediately!), getting a glimpse at the character set chaos, and even listening to a few samples. It helped identify where to dig deeper with code, saving initial analysis time. Think of it as the recon drone before sending in Pandas.

  2. CAMeL Tools: The Arabic Linguistic Genius

    • Role: The specialized cavalry called in to tackle the uniquely Arabic problem of words glued together without spaces (like عزيزيحلم).

    • Why it useful: This wasn’t just a spellchecker; it was like having a computational linguist on call.

      • We specifically used its Morphology Analyzer armed with an Egyptian Arabic database (calima-egy-r13). This database knows the rules of Egyptian Arabic word structure (prefixes, suffixes, valid stems).

      • Our custom function didn’t just guess splits. For a word like عزيزيحلم, it tried all possibilities (عزيزي + حلم, عزيز + يحلم, etc.). Then, it asked the CAMeL Analyzer: “Are both of these parts valid words/morphological units in Egyptian Arabic?”

      • If CAMeL said “Yes!” for a split, we knew we’d likely found the correct segmentation and inserted the space.

    • Impact: This linguistically-aware approach was far more accurate than simple dictionary lookups, especially for a language with rich morphology like Arabic. It rescued countless valid transcripts that were otherwise unusable due to formatting errors.

  3. num2words Library: The Multilingual Number Translator

    • Role: Converting numerical digits (both Western 0-9 and Eastern Arabic ٠-٩) into their full Arabic word spellings.

    • Why it’s useful: Simple, effective, and handled the Arabic conversion flawlessly (lang=‘ar’). Made standardizing numbers into text (which ASR models generally prefer) a breeze within the cleaning pipeline.

Dataset Wrangled… What Next?

After this extensive process of aggregation, analysis, cleaning, and filtering, the initial 570 hours were reduced to a final dataset of approximately 72 hours. While significantly smaller, this dataset is expected to be of higher quality and better suited for model training.

But was it worth it? Did this carefully assembled dataset actually improve accuracy metrics when fine-tuning Whisper on Arabic (specifically Egyptian), or training other ASR models for Arabic?

That’s what we’ll dig into next—results, training, and whether the custom dataset path led to better performance… or just a valuable lesson.