ASR for Scottish Gaelic

Protecting endangered languages through technology

From voice assistants to automatic subtitles, Automatic Speech Recognition (ASR) is part of our everyday life. It has improved access to information and allowed us to communicate better with computers on various tasks. For researchers at the University of Edinburgh, School of Literatures, Languages and Cultures, ASR also can potentially help protect, document and revitalise endangered languages such as Scottish Gaelic.

As explained by Evans et al. 2022, ASR systems commonly require 3 components: an acoustic model (AM), trained on transcribed speech data, a language model (LM) trained on text data and a pronunciation lexicon, which is the intermediate between the two models. For minority languages, this kind of data is commonly scarce or lacking entirely. Hence, augmentation techniques need to be employed to improve the quality and quantity of data, followed by data preparation, and alignment of audio and transcript. After overcoming such difficulties, the research team, led by Professor Lamb, used the Kaldi open-source speech recognition toolkit for the construction of several Gaelic ASR systems.

For training the neural network that underlies the AM and LM the researchers used their free allocation at the University’s HPC cluster, Eddie. This removed a considerable financial burden from the project compared to commercial services, while waiting times were reasonable and there was no need to explore exclusive or priority access routes.

The results were promising, reaching a 26.30% WER, a standard evaluation metric, which measures how much the transcription output by the ASR system differs from a reference transcription. The system has since been used for subtitling online Gaelic media and has recently been made available as a webapp, in collaboration with the University of Bangor. As more training data becomes available, the performance will improve further.

Prof Lamb, Personal Chair in Gaelic Ethnology and Linguistics in the section of Celtic and Scottish Studies and an EFI Research Affiliate, was the guest speaker at the 2022 Lunchtime Seminar Series. You can watch the recording of this presentation at the Digital Research Media Hopper Channel (EASE log-in required).


This research was funded by the Soillse Research Fund DDI/SFC`s BEACON Build Back Better Open Call and the Scottish Government.

Publications

  • Lucy Evans, William Lamb, Mark Sinclair, and Beatrice Alex. 2022. Developing Automatic Speech Recognition for Scottish Gaelic. In Proceedings of the 4th Celtic Language Technology Workshop within LREC2022, pages 110–120, Marseille, France. European Language Resources Association.

Related items

Why don't you explore featured projects demonstrating the use of similar resources and related training opportunities? Have a look at the carousels below.