Football Matches Outcome Prediction Using Supervised Machine Learning Algorithms

Student: Magomedov Rustam

Educational Programme: Master of Data Science (Master)

Final Grade: 10

Year of Graduation: 2024

The prediction of football matches remains a challenging task for football enthusiasts and researchers all over the world. The complexity of predictions stems from a multitude of internal factors surrounding the game – team form, players' character and shape, training setup, derby factor, the level of support, etc. The variability of external events also cannot be overseen; weather conditions, field length and width, grass quality and height – all these factors can be listed as impactful over the game outcome. Although companies and agencies collect a magnitude of data per game, match outcome prediction remains challenging due to the limited data available to the public. This paper aims to explore the possibilities of applying supervised machine learning algorithms to predict the outcomes of a football game. The primary novelty of this work revolves around the customized dataset with more than 1000 matches and more than 1,600,000 events. The author analyses event data for the English Premier League for seasons 2021-2024. Event data is divided into 3 samples to prove the ability of raw and non-aggregated event data to compete with conventional statistical and ranking-based approaches. The work compares the results of match outcome prediction accuracy obtained from the 4 supervised learning models: logistic regression, XGBoost, random forest, and graph neural network. The modules are trained on 3 different samples of event data. The accuracy score is then evaluated against other modules, as well as against the state-of-the-art results. The best accuracy for the multi-class problem is achieved using the logistic regression model on the 10% sample test aggregated event data with an accuracy of 0.63. For non aggregated event data with engineered features, logistic regression shows an accuracy of 0.585, while random forest and XGBoost obtain an accuracy of 0.55 and 0.52 respectively. The graph neural network trained on pass coordinates for a multiclass task performs with an accuracy of 0.41, while the binary classification excluding draws results in an accuracy of 0.53.

Full text (added June 3, 2024)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses