Why do we need multilingual speech recognition

Speech recognition technologies are present in our everyday life more than we are aware of, either in the form of speech assistants such as Siri or Alexa, or with automatic subtitles in YouTube or Netflix. Less visible in our day to day but of a great importance for the society, these technologies also play a crucial role in the accessibility to cultural heritage. It would be impossible to find a specific radio or television program among the hundreds of thousands of hours that broadcast archives hold, or recover a testimony lost in an interview collection if it would not be because of automatic transcription and search engines. Subtitles are also a key to make audiovisual contents accessible for people with hearing impairments.

Even though speech-to-text has been making incredible advances over the past years and its quality and performance are improving permanently, this field has still a remaining debt: supporting real-life multilinguality. By this I mean the change or combination of languages withing the same fragment of speech. And this is no rare use case but rather the way in which large sectors of the population actually speak, namely those who migrated or grew up in homes or neighborhoods where more than one language spoken. Just walking around most European cities could give us an idea of how people combine words of different languages into the local one, or switch languages according to who they are speaking with. Not to mention the large variety of accents, dialects and sociolects that exist. Taking the case of Germany, where aureka is based, we can say that multilinguality is a close reality for at least a third of the population, namely those with migration history or background (reference population with migration background). Linguists have shown how migration can enrich language, introducing variations and meanings (link).

….(to be continued)