We use cookies in order to improve the quality and usability of the HSE website. More information about the use of cookies is available here, and the regulations on processing personal data can be found here. By continuing to use the site, you hereby confirm that you have been informed of the use of cookies by the HSE website and agree with our rules for processing personal data. You may disable cookies in your browser settings.

  • A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Authorship Attribution of Source Code Based on Programmers Coding Style

Student: Bogomolov Egor

Supervisor:

Faculty: School of Computer Science, Physics and Technology

Educational Programme: Applied Mathematics and Information Science (Bachelor)

Final Grade: 10

Year of Graduation: 2019

Source code authorship attribution is an important problem for resolving plagiarism and copyright issues in the programming field. Modern solutions for authorship identification use features specific to a particular language and can not be easily applied to another one. Also, existing works did not test solutions on large amounts of data due to the lack of an appropriate dataset. In this work, we present a tool for mining the history of project changes that can be used to collect datasets for authorship identification from projects with multiple authors. Considering history instead of static snapshot results in a significantly higher amount of data collected from a single project. Then, we collect 7 datasets from IntelliJ IDEA project to test different aspects of authorship detection models. Also, we design two models achieving state-of-the-art accuracy in authorship attribution for Java, Python and C++ code. The models do not depend on features of a particular language and can be applied to any programming language without modification. Keywords: source code authorship identification, stylometry, abstract syntax tree, machine learning, path-based representations, random forest.

Full text (added May 26, 2019)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses