Master
2020/2021
Introduction to Machine Learning and Data Mining
Type:
Elective course (Data Science)
Area of studies:
Applied Mathematics and Informatics
Delivered by:
School of Data Analysis and Artificial Intelligence
Where:
Faculty of Computer Science
When:
1 year, 3, 4 module
Mode of studies:
offline
Instructors:
Dmitry I. Ignatov
Master’s programme:
Data Science
Language:
English
ECTS credits:
5
Contact hours:
36
Course Syllabus
Abstract
The course “Machine Learning and Data Mining” introduces students to new and actively evolving interdisciplinary field of modern data analysis. Started as a branch of Artificial Intelligence, it attracted attention of physicists, computer scientists, economists, computational biologists, linguists and others and become a truly interdisciplinary field of study. In spite of the variety of data sources that could be analyzed, objects and attributes that from a particular dataset poses common statistical and structural properties. The interplay between known data and unknown ones give rise to complex pattern structures and machine learning methods that are the focus of the study. In the course we will consider methods of Machine Learning and Data Mining. Special attention will be given to the hands-on practical analysis of the real world datasets using available software tools and modern programming languages and libraries.
Learning Objectives
- To familiarize students with a new rapidly evolving filed of machine learning and mining, and provide practical knowledge experience in analysis of real world data.
Expected Learning Outcomes
- Students know basic notions and terminology used in MLDM.
- Students understand fundamental principles of modern data analysis.
- Students develop mathematical models of MLDM.
- Students analyze real world data.
Course Contents
- Introduction to Machine Learning and Data MiningIntroduction to modern data analysis. Machine Learning. Data Mining and Knowledge Discovery in Data Bases. Course structure. Basic tasks and examples.
- Clustering and its basic techniquesThe task of clusterization. K-means and its modifications (k-medoids and fuzzy cmeans clustering). Density-based methods: DB-scan and Mean Shift. Hierarchical clustering. Criteria of quality.
- Classification and its basic techniquesThe task of classification. 1-Rules. K-Nearest Neighbours approach. Naïve Bayes. Decision Trees. Logistic Regression. Quality assessment: precision, recall, F - measure, loss-function, confusion-matrix, cross- validation and learning curves (ROC, lift etc.). Multi-class and multi-label classification.
- Frequent Itemset Mining and Association RulesFrequent itemsets. Apriori and FP-growth algorithms. Association rules. Interestingness measures: support and confidence. Closed itemsets. Connection with Lattice Theory and Formal Concept Analysis. Applications.
- Feature Selection and Dimensionality Reduction. Outlier detectionFeature selection versus feature extraction and generation. Singular Value Decomposition, Latent Semantic Analysis and Principal Component Analysis. Boolean Matrix Factorization. Outlier and novelty detection techniques.
- Recommender Systems and AlgorithmsCollaborative filtering. User-based and item-based methods. Slope one. Association rules based and bicluster-based techniques. Quality assessment: MAE, precision and recall. SVD-based approaches: pureSVD, SVD++ and time-SVD. Factorization machines.
- Ensemble Clustering and ClassificationEnsemble methods of clusterization for k-means partitions’ aggregation. Ensemble methods of classification: Bagging, Boosting, and Random Forest.
- Multimodal relational clusteringBiclustering. Spectral co-clustering. Triclustering. Two-mode networks. Folksonomies and resource-sharing systems. Multimodal approaches. Applications: Community detection in Socail Network Analysis and gene expression analysis.
- Artificial Neural Methods and Stochastic Optimization. Elements of Statistical LearningArtificial Neural Networks. Basic ideas of Deep Learning. (Stochastic) gradient descent. Statistical (Bayesian) view on Machine learning.
- Machine Learning Tools and Big DataOrange, Weka, and Scikit-learn. Machine Learning for Big Data: Apache Spark.
Assessment Elements
- Homework
- Research project
- ExamThe final exam consists of oral project defense, a student can be asked to answer some theoretical or practical questions. Оценка выставляется по формуле, учитывающей накопленную оценку. Экзамен проводится дистационно в устной форме (защита проекта) на платформе Zoom. К экзамену необходимо подключиться согласно расписанию защит, высланному преподавателем на групповую почту студентов накануне экзамена. Могут быть заданы дополнительные теоретические вопросы или выданы небольшие практические задания, не требующие использования программирования.
Bibliography
Recommended Core Bibliography
- Han, J., Kamber, M., Pei, J. Data Mining: Concepts and Techniques, Third Edition. – Morgan Kaufmann Publishers, 2011. – 740 pp.
Recommended Additional Bibliography
- Hall, M., Witten, Ian H., Frank, E. Data Mining: practical machine learning tools and techniques. – 2011. – 664 pp.