We use cookies in order to improve the quality and usability of the HSE website. More information about the use of cookies is available here, and the regulations on processing personal data can be found here. By continuing to use the site, you hereby confirm that you have been informed of the use of cookies by the HSE website and agree with our rules for processing personal data. You may disable cookies in your browser settings.

  • A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Automatic POS-tagging for Chinese Using Parallel Data

Student: Konovalova Aleksandra

Supervisor: Olga Lyashevskaya

Faculty: Faculty of Humanities

Educational Programme: Fundamental and Computational Linguistics (Bachelor)

Final Grade: 9

Year of Graduation: 2021

The aim of our study is to compare approaches to part-of-speech (POS) annotation and build a fine-tuned model for Chinese texts presented in the Russian-Chinese Parallel Corpus of the Russian National Corpus using parallel data. Automatic POS tagging is a fundamental task for natural language processing. Chinese poses a number of challenges for classical methods as it has little morphology and no spaces or other delimiters between words. State-of-the-art algorithms involve joint part-of-speech tagging and word segmentation. Another approach usually applicable to low-resourced languages suggests using the parallel data to improve labelling. Although the task of POS tagging for Chinese has been relatively well solved, texts translated from Russian contain a number of loanwords, which may cause a problem for traditional models. Therefore, in this work, we will apply the algorithms to such data and fine-tune the models using tagging of parallel Russian texts. For evaluation, we manually annotated test data, made a comparative analysis of the use of part-of-speech tags in different Chinese standards, and built a neural network based on parallel Russian-Chinese data.

Full text (added May 24, 2021)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses