Application of Anomalous Clustering Methods for Determination оf the Number of Clusters

Student: Rykov Andrey

Educational Programme: Business Informatics (Bachelor)

Year of Graduation: 2021

Clustering is a set of techniques in data analysis aimed at finding groups of similar objects in data, that are clusters. K-Means is the most popular clustering algorithm. However, it has some limitations. Any run of K-Means requires a number of clusters, as well as initial cluster centers, to be pre-specified by the user. The goal of this work is to explore the potential of anomalous clustering methods Anomalous Patterns and Big Anomalous Clusters One-by-one as a tool for finding the optimal input parameters for K-Means. Here is a list of our objectives: reviewing popular methods for detecting the optimal number of clusters; adapting anomalous clustering algorithms for the purpose; implementing all the reviewed algorithms and statistics in the programming Python environment; conducting computational experiments on various datasets; analysis of the results. In this project, we used synthetic data with Gaussian clusters and seven datasets from UC Irvine Machine Learning Repository. We examined an analytic version of Elbow method (Curvature method, Elbow Level 1), and proposed a different procedure for this method, Elbow Level 2. Also Hartigan Rule of the Thumb, Calinski-Harabasz, and Silhouette Width-based methods have been investigated. Performance evaluation metrics used are Adjusted Rand Index, Normalized Mutual Information and Mean Absolute Relative Error at the extracted number of clusters (MARE); the latter has been proposed in this work. The results of computational experiments show that the most exact popular methods are Elbow Level 2 and Silhouette Width. The Anomalous Patterns is the most accurate and time-efficient out of two anomalous clustering methods under consideration. Anomalous Patterns has a smaller precision than Elbow Level 2 and Silhouette Width on the data with small numbers of features, but the difference in performance is insignificant on data with greater numbers of features. In addition, Anomalous Patterns method outperforms popular methods over the execution time, being 100 times faster on all the datasets. This algorithm seems to be much promising. Therefore, it is desirable to explore different methods for the center initialization and filtering of extracted clusters at execution of the anomalous clustering methods.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses