More SpeechSynthesis for Indigenous Languages

Kendall Moraski
4 min readApr 9, 2021

Ideas for extending TTS tech to support ALL languages

In my explorations of speech synthesis technologies, I was surprised (but not really), that only a handful of languages seem to be currently supported. It perplexed me that the BCP47 language subtags are, at first glance, all-encompassing, covering every human language from Corsican to Mohegan-Pequot to Tunjung (click the dropdown under “List” and select “Languages (takes a while!”). Having backtracked my way to the BCP47 search engine by originally reading the documentation for the SpeechSynthesisUtterance interface of the Web Speech API, I was looking to incorporate Text-To-Speech (TTS) functionality. I’m working on a project called LANGBACK where this kind of audio technology could be leveraged into creating artificial conversation partners for solo and/or remote learners of specifically Indigenous languages.

To my dismay, the mere existence of the more than 8200 language subtags in BCP47 appears to be for no more than documentation purposes; trying to implement the tags for Wôpanâôt8âôk text in HTML, for example, will not render an accurate Wôpanâôt8âôk pronunciation in TTS at all. I had to resort to using unsightly Anglicization so that the robotic, synthesized voice could sound out a very rough approximation. Through trial and error, I experimented with different ‘spellings’, to get something close to what the word should sound like if spoken in the actual language, but I shouldn’t have to stoop to this. Learners of Indigenous languages, especially Indigenous folks themselves who are revitalizing their heritage languages, shouldn’t have to stoop to this. There needs to be a more concerted effort for the technology that already exists out there to extend its capabilities beyond the mundanity of the so-called “global” languages to the underserved, underrepresented Indigenous languages worldwide.

Let me back up here a second. For those unfamiliar, the Web Speech API is made up of two components: SpeechSynthesis and SpeechRecognition. While you might have heard of speech recognition in general (hey Siri), speech synthesis is exactly what it sounds like. See what I did there? Speech output from the computer is output via techniques such as TTS, where a built-in voice reads out loud what is typed. A large portion of the Web Speech API is used, though not exhaustively, for accessibility purposes. This provides all users alternative ways of comprehending text.

International Phonetic Alphabet 2020 Full Chart
International Phonetic Alphabet 2020 Full Chart

The IBM Watson TTS service goes so far as to even allow for IPA phonemes to be literally entered in code, a feat in its own right, for more accurate pronunciation. It even allows the developer to use phonetic symbols using Unicode. However, upon further digging, I found that this was only intended to cover regional pronunciation variants of words within more widely spoken languages like English, such as the different regional pronunciations of “tomato”. Another example comes from Spokestack, which does support both Speech Syntax Markdown Language (SSML) and Speech Markdown syntax. Their page is quick to point out that their focus is on English, and therefore the IPA characters that they do support are only for “loan words.” Whew, we’ve still got a long way to go with this. As my colleague Aiden Pine stated, “There is no existing service that lets you use arbitrary IPA symbols, nor are there any services for TTS for the vast majority of Indigenous languages.” Period.

What’s wild is the depths to which some Speech Synthesis technologies can already dive to mimic things like prosody, or stress/rhythm and tone/intonation in speech. The Google Cloud Text-to-Speech reference will give anyone with a background in linguistics flashbacks to Phonology 300 courses, but it’s frustrating that, unlike typical linguistics case studies, the sample data is all English! The sad reality is that there is truly a dearth of speech synthesis AND speech recognition services for Indigenous languages. This is a problem because it emphasizes and perpetuates the persistent devaluing of Indigenous languages and their place in the modern world. As has been widely reported for decades since the language revitalization movement took hold, when a language loses all its speakers, all the knowledge systems encoded in that language are lost as well. This is a detriment to humanity. The accessibility of rapidly advancing technology coupled with globalization has contributed to this linguicide, but this same technology can actually be harnessed and used to reverse the course.

Some positive work has been done to start to shift the trajectory, including speech synthesizers for Mohawk, Plains Cree, but as confirmed through my conversations with Aiden, there still is a lot of work to be done, including hours upon hours of data that needs to be recorded in an ethical and responsible manner in order to properly create these speech synthesizers for Indigenous languages. Even ‘non-threatened’ languages like Swahili, which is represented in Google Translate TTS, still sounds more than a little artificial and potentially inaccurate. You can test it out yourself here. Seriously, try pressing the speaker button after inputting a word to translate; it’s barely intelligible. If a language with upwards of 100 million speakers cannot get accurate speech synthesis representation on Google of all places, what chance do languages on the brink of extinction stand? The capability is extant, the phonemes have been documented, now the speech synthesis technology needs to catch up and seize the opportunity to have a hand (or a voice) in Indigenous language revitalization.

--

--