Automated Product Tagging in E-commerce Using Multi-label Classification

Student: Stepan Grigorov

Supervisor: Margarita Burova

Educational Programme: Master of Data Science (Master)

Final Grade: 8

Year of Graduation: 2024

Current thesis paper presents a set of potential methods to be applied in the industry for categories or tags predictions of products. The end-to-end process includes collecting the datasets, applying exploratory data analysis, datasets preprocessing, choosing the metrics to validate upon and choosing effective training and evaluation strategy. The datasets collection includes finding appropriate datasets with the text-based data with hierarchical categories as the target variable. We have managed to find 4 datasets different in terms of both the size and the quality that makes the experimental process ubiquitous. Exploratory data analysis followed by preprocessing are carried out using both built-in Python functionality and vendor libraries such as pandas, numpy, nltk, wordcloud and other well-known production-used libraries. In addition to that, vectorization methods are chosen for further application on preprocessed data using various models. From the metrics perspective, we have chosen 7 metrics to evaluate upon: subset accuracy (exact accuracy), adapted accuracy, precision (micro and macro), recall (micro and macro) and Hamming loss. We have trained 15 different methods with 6 different methods from which 2 are neural-network BERT-family based pre-trained models: BERT and DeBERTa while other machine learning models are RakelO, ML-KNN, CatBoostClassifier and ClassifierChain. For each of the machine learning models we have applied vectorization techniques. Based on the results obtained we can see that choosing ML-KNN with TF-IDF would make the model perform with stable high results for any dataset while RakelO with TF-IDF or Word2Vec technique would work well on multi-dimensional data. Moreover, small dataset size might lead to overfitting of neural-network based models such as BERT or DeBERTa.

Full text (added June 4, 2024)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses