Introduction to Machine Learning and Data Mining

Master 2020/2021

Type: Elective course (Data Science)

Area of studies: Applied Mathematics and Informatics

Delivered by: School of Data Analysis and Artificial Intelligence

Where: Faculty of Computer Science

When: 1 year, 3, 4 module

Mode of studies: offline

Instructors: Dmitry I. Ignatov

Master’s programme: Data Science

Language: English

ECTS credits: 5

Contact hours: 36

Full Syllabus Ask Question

Abstract

The course “Machine Learning and Data Mining” introduces students to new and actively evolving interdisciplinary field of modern data analysis. Started as a branch of Artificial Intelligence, it attracted attention of physicists, computer scientists, economists, computational biologists, linguists and others and become a truly interdisciplinary field of study. In spite of the variety of data sources that could be analyzed, objects and attributes that from a particular dataset poses common statistical and structural properties. The interplay between known data and unknown ones give rise to complex pattern structures and machine learning methods that are the focus of the study. In the course we will consider methods of Machine Learning and Data Mining. Special attention will be given to the hands-on practical analysis of the real world datasets using available software tools and modern programming languages and libraries.

Learning Objectives

To familiarize students with a new rapidly evolving filed of machine learning and mining, and provide practical knowledge experience in analysis of real world data.

Expected Learning Outcomes

Students know basic notions and terminology used in MLDM.
Students understand fundamental principles of modern data analysis.
Students develop mathematical models of MLDM.
Students analyze real world data.

Course Contents

Introduction to Machine Learning and Data Mining
Introduction to modern data analysis. Machine Learning. Data Mining and Knowledge Discovery in Data Bases. Course structure. Basic tasks and examples.
Clustering and its basic techniques
The task of clusterization. K-means and its modifications (k-medoids and fuzzy cmeans clustering). Density-based methods: DB-scan and Mean Shift. Hierarchical clustering. Criteria of quality.
Classification and its basic techniques
The task of classification. 1-Rules. K-Nearest Neighbours approach. Naïve Bayes. Decision Trees. Logistic Regression. Quality assessment: precision, recall, F - measure, loss-function, confusion-matrix, cross- validation and learning curves (ROC, lift etc.). Multi-class and multi-label classification.
Frequent Itemset Mining and Association Rules
Frequent itemsets. Apriori and FP-growth algorithms. Association rules. Interestingness measures: support and confidence. Closed itemsets. Connection with Lattice Theory and Formal Concept Analysis. Applications.
Feature Selection and Dimensionality Reduction. Outlier detection
Feature selection versus feature extraction and generation. Singular Value Decomposition, Latent Semantic Analysis and Principal Component Analysis. Boolean Matrix Factorization. Outlier and novelty detection techniques.
Recommender Systems and Algorithms
Collaborative filtering. User-based and item-based methods. Slope one. Association rules based and bicluster-based techniques. Quality assessment: MAE, precision and recall. SVD-based approaches: pureSVD, SVD++ and time-SVD. Factorization machines.
Ensemble Clustering and Classification
Ensemble methods of clusterization for k-means partitions’ aggregation. Ensemble methods of classification: Bagging, Boosting, and Random Forest.
Multimodal relational clustering
Biclustering. Spectral co-clustering. Triclustering. Two-mode networks. Folksonomies and resource-sharing systems. Multimodal approaches. Applications: Community detection in Socail Network Analysis and gene expression analysis.
Artificial Neural Methods and Stochastic Optimization. Elements of Statistical Learning
Artificial Neural Networks. Basic ideas of Deep Learning. (Stochastic) gradient descent. Statistical (Bayesian) view on Machine learning.
Machine Learning Tools and Big Data
Orange, Weka, and Scikit-learn. Machine Learning for Big Data: Apache Spark.

Assessment Elements

Homework
Research project
Exam
The final exam consists of oral project defense, a student can be asked to answer some theoretical or practical questions. Оценка выставляется по формуле, учитывающей накопленную оценку. Экзамен проводится дистационно в устной форме (защита проекта) на платформе Zoom. К экзамену необходимо подключиться согласно расписанию защит, высланному преподавателем на групповую почту студентов накануне экзамена. Могут быть заданы дополнительные теоретические вопросы или выданы небольшие практические задания, не требующие использования программирования.

Interim Assessment

Interim assessment (4 module)
0.2 * Exam + 0.4 * Homework + 0.4 * Research project

Bibliography

Recommended Core Bibliography

Han, J., Kamber, M., Pei, J. Data Mining: Concepts and Techniques, Third Edition. – Morgan Kaufmann Publishers, 2011. – 740 pp.

Recommended Additional Bibliography

Hall, M., Witten, Ian H., Frank, E. Data Mining: practical machine learning tools and techniques. – 2011. – 664 pp.

Course Syllabus