• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
  • HSE University
  • Student Theses
  • Automatic Extraction and Classification of Verb and Nominalization Constructions from Syntactic Corpus of Russian

Automatic Extraction and Classification of Verb and Nominalization Constructions from Syntactic Corpus of Russian

Student: Mediankin Nikita

Supervisor: Olga Lyashevskaya

Faculty: Faculty of Humanities

Educational Programme: Computational Linguistics (Master)

Final Grade: 8

Year of Graduation: 2016

The main goal of this work was to develop a tool for automatic extraction of constructions for a given lemma from dependency annotated treebank of Russian texts, and their preliminary classification, as well as to present the means of processing said corpus, and test the quality of the results. The main motivation behind the development of such tool is its supposed utility for aiding in extension of Russian FrameBank using Russian National Corpus. The work is focused on extraction of verb constructions and, to a lesser extent, on nominal constructions. The scope of this work does not include semantic frame labeling, instead restricting itself to the extraction of surface representation, although some proposals for possible developments are made. As a result, a web application with aforementioned capabilities was developed with flexible interface built upon it that allows the user to choose from numerous options for the purpose of refining the results. As of this writing, an alpha-version of the web application is available, which allows to extract constructions from a version of SynTagRus as well as from a part of reannotated Russian National Corpus. However, combined with RU Syntax command line tool, it is possible to annotate and add other corpora. The web application in question can be considered the next step in developing the idea behind RNC Sketches. It was evaluated against random 200 examples from FrameBank yielding 80.0% recall for extraction of subjects, 84.9% for direct objects, and 81.0% for periphery. Further work may include conducting additional experiments on extraction of non-intrinsic subjects based on long-distance dependencies to refine the extracting rules. The other followup may be implementing of automatic frame labeling based on individual patterns of extracted parts’ features and relation labels. Of certain interest is also developing the means of detecting non-intrinsic subject for nominalizations.

Full text (added June 7, 2016)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses