• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

ETL for Processing Personal Data of Employees

Student: kondrashov artem

Supervisor: Alexander Zubkov

Faculty: Faculty of Computer Science

Educational Programme: Master of Data Science (Master)

Year of Graduation: 2024

Modern times require modern solutions especially when concerning routine tasks, which can be automated. Lately, deep learning (DL) approaches are becoming more popular to solve some of such tasks, and as these techniques are becoming more sophisticated, they can solve more complex tasks. Some DL models may even surpass humans in detecting objects, correcting errors in text, recognising sentiments in the text or providing answers to all sorts of questions. Moreover, some of the models operate faster than human brain and can handle large amounts of information. In this thesis paper we provide the results of the project dedicated to personal data form processing. The project itself is a ETL (Extract, Transform, Load) process, which uses several fine-tuned deep learning models to correct spelling, detect and order the elements of names and addresses, as well as conventional rule-based methods to correct dates format. To fine tune the models, we generated datasets resembling the contents of the form. As the result of the project, we have developed an ETL for processing personal data provided on MS Excel forms in Russian language. The ETL combines both rule-based and deep learning approaches to solve formatting and spellchecking tasks.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses