• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

The Method of Maximum Likelihood for Lifting a Fuzzy Leaf Set in a Domain Taxonomy

Student: Ayrapetyan Zhirayr

Supervisor: Boris Mirkin

Faculty: Faculty of Computer Science

Educational Programme: Data Science (Master)

Year of Graduation: 2021

In this paper the method of Parsimonious Generalization in Fuzzy Sets on the leaves of taxonomy proposed in the scientific group of B. G. Mirkin is modernized. The paper is devoted to the analysis of queries presented as fuzzy sets of topic elements assigned to the leaves of the rooted tree of the taxonomy. The words query, fuzzy set, and cluster on the leaves of the taxonomy tree are synonymous in the context of this paper. To generalize a fuzzy set on the leaves of a taxonomy, it is lifted up in the tree in order to find the main topics that explain this set. The algorithms are applied to the structural analysis and description of the text collection of 17 thousand abstracts of scientific research published in 17 Springer journals devoted to the subject area of Data Science. The Data Science taxonomy used in this paper is a six-level hierarchical taxonomy developed manually by Association for Computing Machinery Computing Classification System. The taxonomy was supplemented by the participants of the scientific group of B. G. Mirkin to take into account the current trends in the subject area. Annotated Suffix Trees (AST) are used to calculate the relevance matrix between the text collection and the keywords of the taxonomy leaves. In the work of B. G. Mirkin's research group, fuzzy taxonomy sets are extracted from the relevance matrix using specially developed spectral clustering method, FADDIS, described in this paper. The FADDIS method is replaced by a combination of two algorithms: first, the leaf vectors are projected onto the first, corresponding to the smallest eigenvalues, several eigenvectors of the normalized Laplace matrix (Laplacian EigenMaps), then the resulting embeddings are clustered using the Fuzzy C-means method. This substitution made it possible to extract and generalize twice as many interpreted clusters. The method of generalizing sets has also been changed. The heuristic method of Parsimonious Generalization is replaced by the probabilistic method of lifting, which uses the maximum Likelihood method (Maximum Likelihood Generalization), which allows you to get rid of the hyperparameters of the model and their settings. Six extracted clusters received their generalizations, namely: information retrieval methods, clustering methods, machine learning, modeling methods, Markov networks and network representation models, and reinforcement learning. The necessary mathematical support for the implementation of the above algorithms has been formed including a program for graphical output of the results of generalization.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses