• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Program for Bot Detection Based on the Analysis of Areas of the Same Intrinsic Dimension in Language Manifolds

Student: Borodin Nikita

Supervisor: Vasilii Gromov

Faculty: Faculty of Computer Science

Educational Programme: Software Engineering (Bachelor)

Year of Graduation: 2024

In this work, a new research object is presented – the language fractal structure. We hypothesize that the set of vector embeddings of all n-grams of a natural language represents a representative sample of such an object (we use the term Hailonakea to denote the totality of all fractal structures of a language across all n). The study evaluates the intrinsic (true) dimensions of the fractal structures of the Russian and English languages, as well as their regions, with the aim of developing an algorithm to distinguish texts written by humans and bots in the respective languages. For this purpose, we use methods based on (1) topological data analysis, (2) constructing a minimum spanning tree for the considered point cloud (Steele's theorem), and (3) hierarchical graph clustering (Zahn's and Gato- Geva's algorithms). For both languages, the dimensions for all n are found to be non-integer values, which is typical for fractal sets, and are close to 9 for both Russian and English. At the same time, comparing texts of the same language written by humans with those generated by bots reveals a lower dimensionality of bot texts relative to human texts. A prototype detector (binary classifier for humans and bots) was developed during the work, achieving an accuracy of 64% (F1-score: 0.68) for Russian and 79% (F1-score: 0.79) for English. The metrics are calculated for models with mGPT and YaLM architectures.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses