• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Japanese Written Language Compositionality in Word Embeddings

Student: Zakoyan Robert

Supervisor: Andrey Borisovich Kutuzov

Faculty: Faculty of Humanities

Educational Programme: Language Theory and Computational Linguistics (Master)

Year of Graduation: 2017

Extracting semantical similarity of a lexicon in a form of Word Embeddings is a well-performing robust method, mostly referred as a top-notch algorithm for statistic-driven representation of synonymy relations. The underlying mechanism of capturing these features relies on context comparison, estimating a pair of words to be more semantically related to each other if they tend to appear in similar contexts. Word embedding models, considered a cutting edge concept as they are, were introduced quite recently and their implementation aspects may vary greatly, thus leaving a lot of space for experimentation and adjustion. Traditionally, these models treat each unique word from a processed corpus separately and independently ( e.g., as a unique token) and do not consider possible relations of their meanings due to their form. While this approach keeps the algorithm language-independent, a lot of valuable data that could contribute to synonymy detection is lost. The nature of word embeddings being essentially a vector space makes it possible to perform algebraic operations over tokens' respective vectors, such as "replacing" a certain aspect of a word's meaning by deduction and sum operations. Hieroglyphic languages, however, offer an opportunity of a unique insight, due to the fact that words are comprised of the distinct elements each bearing its own meaning, or multiple meanings. As of that in most cases word meanings could be hypothesized to be "compositional", or treated as a sum of meanings of hieroglyphs they consist of. Utilizing this information may drastically improve the performance of word embeddings for appropriate languages. This paper is an attempt to discover whether available Word Embedding models are capable of capturing this semantic compositionality and, if so, propose a way to better suit the word embedding mechanism for training on hieroglyphic writing, taking into account specific information encoded in .

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses