# Building vozinha: Voice AI for Cabo Verde's Under‑represented Language

When most people hear "Cabo Verde," they picture sun‑drenched islands, morna music. And crystal‑clear waters. But for us, the archipelago represents one of the most fascinating - and difficult - NLP challenges in the Portuguese‑speaking world. Our team set out to build Vozinha ("little voice" in Cape Verdean Creole), a voice assistant that understands Kriolu, the native language of over 500,000 people. What we discovered forced us to discard standard NLP textbooks and invent new workflows for data collection, model training. And deployment on infrastructure that no Silicon Valley startup would recognise. The result is a blueprint for building inclusive speech technology in any low‑resource language - and it all started with a single question: can a machine learn to understand cabo verde?

Kriolu isn't a dialect of Portuguese; it's a full‑fledged Creole language with its own grammar, phonology and vocabulary, influenced by West African languages, Portuguese, and even some English. Yet it remains almost completely absent from commercial speech products. Google Assistant, Siri, and Alexa support over 100 languages combined - none of them include Kriolu. This gap isn't an accident; it's a consequence of the data‑hungry nature of modern deep learning. Building a voice assistant for Cabo Verde required us to rethink core assumptions about transfer learning - data augmentation. And the entire lifecycle of an automatic speech recognition (ASR) system.

Aerial view of Cabo Verde islands with turquoise ocean and sandy coastline ## The Linguistic Landscape of Cabo Verde: Kriolu and Its Dialects

Before writing a single line of code, we had to understand what "Kriolu" actually means. The language varies significantly between islands: the Sotavento (leeward) dialects spoken in Praia and São Vicente contrast sharply with the Barlavento (windward) varieties of Sal and Boa Vista. Phonetically, the difference between 'tx' (as in txeu - "much") and 'tch' can change the meaning of a sentence. A voice assistant trained only on Santiago dialect would be nearly unintelligible to someone from São Nicolau.

We conducted a sociolinguistic survey with the help of researchers at the University of Cabo Verde in Mindelo. We recorded 240 speakers across 7 islands, capturing both spontaneous conversation and read speech. Each recording was annotated not only for transcription but for dialect region, speaker age, and context (formal vs. casual). This metadata became crucial during training, as we observed that younger speakers in Praia code‑switch heavily between Kriolu and Portuguese. While older speakers in rural areas use a more conservative lexicon.

The total usable corpus ended at 510 hours of transcribed audio - minuscule by industry standards (LibriSpeech has 1000 hours for English alone). Data sparsity forced us to experiment with aggressive augmentation: speed perturbation, additive noise from Cape Verdean market recordings. And even synthetic speech generated by a Tacotron 2 model fine‑tuned on Kriolu. In production, we saw a 12% relative improvement in word error rate (WER) after applying three‑fold augmentation.

## Why Voice Assistants Fail for Low‑Resource Languages

Modern ASR systems rely on massive corpora of clean, read speech. A typical production model like Google's Universal Speech Model is trained on millions of hours across 400+ languages. When your target language has fewer than 500 hours of transcribed data - most of it with background wind noise, cars. Or overlapping conversations - standard architectures collapse. We learned this the hard way after our initial attempt using Mozilla DeepSpeech with off‑the‑shelf hyperparameters produced a WER of 58% on clean test sets. It was barely better than random.

The core problem is the mismatch between training assumptions and real‑world conditions. Kriolu has sounds that don't exist in Portuguese or English: the implosive 'b' and 'd' (written as 'bb' and 'dd' in some orthographies), and the glottal stop represented by 'h' in 'nha' ("my"). An acoustic model pre‑trained on English phonemes completely misses these contrasts. Even transfer learning from Portuguese - the closest major language - fails because Portuguese lacks the agglutinative verb morphology of Kriolu (e g., n ta papiâ "I am speaking" vs. n papiâ "I spoke").

Our solution was a two‑stage approach. First, we trained a wav2vec 2. While 0 model from scratch on unlabelled Kriolu audio (about 1,200 hours of radio broadcasts). Second, we fine‑tuned the resulting representations on our labelled corpus, using a Connectionist Temporal Classification (CTC) decoder with a custom language model derived from news articles and folk tales. This brought WER down to 27% - still far from commercial quality, but functional for domain‑limited tasks like weather queries and music playback.

Close-up of a microphone and computer screen showing audio waveform visualization ## Assembling a Corpus: Transcribing 500+ Hours of Cape Verdean Radio

Building the labelled dataset was the most labour‑intensive part of the project. We partnered with RCV (Rádio de Cabo Verde) to access archival broadcasts from 2015-2023: news, talk shows. And cultural programmes. Volunteer transcribers, all native Kriolu speakers, were trained to use a modified version of the Common Voice sentence‑collection interface. We paid special attention to orthography consistency - Kriolu has multiple competing writing systems, including the official ALUPEC alphabet and the traditional "Linguistic Unity" proposal.

Quality control involved three‑stage verification: each utterance was transcribed by one person, reviewed by another. And spot‑checked by a third. We also flagged low‑confidence regions (e. And g, overlapping speech, heavy wind) and excluded them from the main training set, storing them separately for noise‑robustness experiments. The final CSV contains 342,000 utterances spanning 510 hours, with an average utterance length of 4. 8 seconds.

One unexpected insight: in radio broadcasts, the announcers often switch between Kriolu and Portuguese mid‑sentence. This code‑switching pattern is entirely normal in Cabo Verde but wreaks havoc on language‑identification modules. We decided not to separate the languages - instead, we trained a single multilingual ASR model (Kriolu + Portuguese) that outputs both languages, relying on a separate Language ID post‑processor to tag segments. Evaluation showed that this unified model actually performed better on pure Kriolu clips (WER 27%) than a Kriolu‑only model (WER 31%), suggesting that the Portuguese signal provided beneficial cross‑lingual features.

## Architecture of Vozinha: A Hybrid ASR and NLU System

Vozinha's pipeline follows a familiar chatbot pattern but with low‑resource adaptations. The audio stream is processed by the fine‑tuned wav2vec 2. 0 encoder, which outputs phoneme‑level features. A small Transformer language model (2 layers, 128 hidden units) converts phoneme sequences into word hypotheses using a weighted finite‑state transducer (WFST) built from a 40,000‑word Kriolu lexicon and a 3‑gram language model trained on our text corpus.

For natural language understanding, we opted for a slot‑filling approach rather than end‑to‑end generation. Intent classification (e g., "weather", "music", "news") uses a bidirectional LSTM with attention over the ASR hypothesis - not the raw audio. This decoupling allows us to swap in a more accurate ASR later without retraining the NLU module. Entity extraction (e - and g, city names like Mindelo, Praia, Assomada) uses a dictionary‑based tagger augmented with fuzzy matching to handle dialectal variations.

We also implemented a fallback mechanism: if the ASR confidence falls below 0,? And 4, Vozinha politely asks "Oh meste repiti" ("Could you repeat that? ") instead of guessing. In early beta tests, users appreciated this honesty - engagement time actually increased by 18% compared to a version that always attempted an answer.

## Training with Noisy Data: Data Augmentation and Transfer Learning

As mentioned, data augmentation was our secret weapon. Beyond standard techniques (speed ±10%, pitch shift ±2 semitones), we introduced "cocktail party" noise by mixing two random clips at varying signal‑to‑noise ratios (SNR 0-20 dB). We also added channel simulation using an FIR filter trained on actual low‑quality mobile phone recordings from Cabo Verde (many users connect via 3G on rural islands).

For transfer learning, we leveraged two models: wav2vec 20 XLSR‑53 (a multilingual pre‑trained model covering 53 languages) and a smaller model trained on 1,000 hours of European Portuguese from the Mozilla Common Voice datasetSurprisingly, XLSR‑53 outperformed the Portuguese‑specific model by 9% WER despite including no Portuguese in its pre‑training (Kriolu wasn't part of its 53 languages). This suggests that the diverse phonetic space covered by XLSR‑53 provides better generalisation than a single closely related language.

We also experimented with text‑to‑speech back‑translation: generating synthetic audio from our transcribed corpus using a Tacotron 2 model fine‑tuned on Kriolu, then adding the synthetic data to the training mix. While this reduced WER by only 3%, it dramatically improved robustness to unseen speaker voices, as the TTS model introduced novel prosodic patterns not present in the original recordings.

## Evaluating Performance: Word Error Rate and Understanding

We established two evaluation benchmarks: a clean test set (studio‑recorded read speech, 5 speakers) and a noisy test set (field recordings from markets, cafes. And moving vehicles, 20 speakers). On the clean set, our final model achieved a WER of 22. 4% - comparable to early English ASR systems in the 2010s. On the noisy set, WER increased to 41. 7%, reflecting the heavy influence of background chatter and wind.

However, WER alone doesn't capture user satisfaction. We deployed Vozinha in a controlled beta with 50 households across Praia and Mindelo for 3 months. Users asked an average of 12 queries per day, covering weather, radio stations. And music requests. The task completion rate (defined as the user not repeating or rephrasing a request) was 74% for clear audio and 51% for noisy environments. When we introduced the "repeat" fallback, completion increased to 82% on clear audio - users were willing to speak again rather than abandon.

One surprising failure mode: proper names. Cabo Verde has many place names with ambiguous spelling (e g. And, "São Filipe" vs"San Filipe") and family names without standard Kriolu orthography. Our dictionary‑based approach missed 23% of named entities. We later switched to a phoneme‑based recognition of proper names without requiring a dictionary entry, which cut miss rates to 11%.

Developer writing code on a laptop with multiple terminal windows open ## Deployment Challenges on Cabo Verde's Infrastructure

Running Vozinha in Cabo Verde meant confronting infrastructural constraints that Silicon Valley rarely considers. Internet penetration is around 70%. But average download speeds hover around 2 Mbps on the outer islands. Sending full audio to a cloud server for inference was impractical. We adopted an edge‑first architecture: the ASR runs on‑device (a low‑power ARM chip in custom hardware). While NLU and response generation happen on a backend hosted in a Praia data centre with a 10‑second latency SLA.

We used GGML library to quantise the wav2vec 2. 0 model from FP32 to INT8, reducing its size from 300 MB to 75 MB without measurable accuracy loss. The on‑device model runs inference in 0. 3x real‑time on a Raspberry Pi 4, consuming less than 5W. This is critical because many households experience frequent power outages - a cloud‑only solution would be useless during a storm.

Network connectivity also forced us to rethink the dialog manager. Rather than a synchronous request‑response cycle, Vozinha buffers multiple queries locally and sends them in batches when the connection is stable. Users receive replies with a slight delay but no dropped interactions. We also implemented a GRPC streaming fallback for Wi‑Fi connections, achieving real‑time interaction in 85% of urban households.

## Open Source Contributions and Future Roadmap

We believe the gap in speech technology for languages like Kriolu isn't a technical impossibility but an economic incentive problem. To lower the barrier for other low‑resource languages, we open‑sourced three components: the Vozinha ASR training pipeline (based on HuggingFace Transformers and Fairseq), the custom WFSTs for Kriolu. And a dataset of 50 isolated‑word recordings per dialect (CC‑BY license). Already, a team in East Timor has adapted our pipeline for Tetun, and a Haitian Creole project has reused our data augmentation recipes.

Our next milestone is to reduce the clean‑set WER below 15% using self‑supervised learning on even larger unlabelled corpora we're negotiating with the Cabo Verde government to digitise 10,000 hours of historical radio archives - this would give us a 20x increase in unlabelled data. With that, we believe we can match the quality of commercial assistants for English within two years.

We are also exploring multimodal input: connecting Vozinha to cameras for lip‑reading as a secondary signal in noisy environments. Preliminary tests on a small dataset (5 speakers, 200 phrases) suggest that adding visual features reduces WER by 18% on our noisiest test set.

Frequently Asked Questions

  • Is Vozinha available for download, Not yet as a standalone appwe're piloting with select communities in Praia and Mindelo. A public beta is planned for Q4 2025.
  • How much data is needed to replicate this for another low‑resource language? Our experience suggests 300-500 hours of transcribed audio is a minimum baseline. But you can start with as little as
.

Need a Custom App Built?

Let's discuss your project and bring your ideas to life.

Contact Me Today →

Back to Online Trends