• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Comparison of the Methods of Embedding Russian-language Sentences on the Linguistic Test SentEval

Student: Maslennikova Elizaveta

Supervisor: Vladimir Krylov

Faculty: Faculty of Informatics, Mathematics, and Computer Science (HSE Nizhny Novgorod)

Educational Programme: Data Mining (Master)

Year of Graduation: 2020

Nowdays, people are very absorbed in the ideas of developing and implementing artificial intelligence. A simple smartphone user may not even realize how much his mobile phone is saturated with the latest technology and machine learning algorithms. Recently, various voice assistants are gaining great popularity, which greatly facilitate the person's daily life. But the problem of interaction between a person and a device has always stood and is still relevant on the path of this progress, since the natural language in which people speak is a very complex structure. In this case, the computer can operate only with sets of numbers. Therefore, for the conversion of texts into a "understandable" for the computer type various kinds of embeddings are used. At the moment, a large number of such models have been proposed. But most of them are based either on a simple coding of the letters that make up the word or on the coding of whole words, not taking into account their lexical meaning. Moreover, such models show a poor quality of transformation of a whole set of sentences, paragraphs or simply large texts, where each subsequent sentence in one way or another depends on the previous one, continuing its main idea. Consideration of only one phrase taken out of context is often not enough for a competent understanding and analysis of the entire text in question. Therefore, in recent years, researchers have been quite actively developing models for transforming texts based on the processing of entire sentences as a single structural unit, which combine its semantics, syntax, and lexical meaning of entire passages. This work is precisely devoted to the study of the effectiveness of using various models of vector representations for Russian sentences, taking into account the peculiarities of the language. This area of research is quite new, in the scientific literature there is a very small number of studies on the topic of vector representations of whole sentences, which are mostly intended for English as a universal international language. Moreover, there are practically no pre-trained models of proposal embeddings for the Russian language, and there are absolutely no studies comparing the quality of various models for different tasks, taking into account the features of the language used, which emphasizes the scientific novelty and practical significance of this work.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses