Development of a Speech Annotation Tool Using the Example of Kazym Northern Khanty Language

Student: Apollinariya Karpova

Educational Programme: Fundamental and Computational Linguistics (Bachelor)

Year of Graduation: 2024

Despite significant advancements in speech processing technologies such as Automatic Speech Recognition (ASR), Diarization, and Spoken Language Identification (SLID) (e.g., Park et al., 2022; Li, 2022), many low-resource languages remain underrepresented. This project focuses on developing several speech processing tools for Kazym Khanty, a low-resource Finno-Ugric language spoken by approximately 9,500 people in western Siberia, according to the 2010 census (Russian Federal State Statistics Service, 2010). We compiled two novel datasets for this project. The first dataset, used for diarization, consists of 26 audio recordings totaling 1,273 minutes, annotated for speaker turns, speech fragment start and end times, and the language spoken. The second dataset, used for ASR, comprises 5 hours of annotated speech data, including monologues, dialogues, tales, stories, and news segments. These datasets were sourced from fieldwork conducted between 2018 and 2024 and online publications in the Khanty language. Using the first dataset, we fine-tuned the pyannote.audio diarization model, reducing its Diarization Error Rate (DER) from 38.9% to 26.3%. This improvement will assist researchers by making the transcription process of elicitation sessions with Khanty speakers more efficient. Furthermore, it may be beneficial to others working with low-resource languages and similar interaction styles in audio recordings. For the second dataset, we fine-tuned the Whisper multilingual small model for speech recognition in Khanty, achieving a 44.57% Word Error Rate (WER). An ASR system specifically developed for Khanty has potential to significantly aid researchers in collecting data for text corpora and transcribing elicitation sessions and monologues by Khanty speakers. Even with moderate quality and a high WER, this ASR model can enhance the efficiency of the transcription process. Additionally, we explored integrating a SLID model into our diarization system to help researchers focus solely on the Khanty segments requiring transcription and translation. However, after achieving only 53% accuracy with the SpeechBrain SLID model, we concluded that additional fine-tuning is needed before it can be effectively used in conjunction with the diarization model.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses