Магистратура
2020/2021
Современные методы анализа данных
Статус:
Курс обязательный (Науки о данных)
Направление:
01.04.02. Прикладная математика и информатика
Где читается:
Факультет компьютерных наук
Когда читается:
1-й курс, 1, 2 модуль
Формат изучения:
без онлайн-курса
Прогр. обучения:
Науки о данных
Язык:
английский
Кредиты:
4
Контактные часы:
56
Course Syllabus
Abstract
This is a course in basic methods for modern Data Analysis. Its contents are heavily influenced by the idea that data analysis should help in enhancing and augmenting knowledge of the domain as represented by the concepts and statements of relation between them. This view distinguishes the subject from related courses such as applied statistics, machine learning, data mining, etc. Two main pathways for data analysis are: (1) summarization, for developing and augmenting concepts, and (2) correlation, for enhancing and establishing relations. Visualization, in this context, is a way of presenting results in a cognitively comfortable way. The term summarization is understood quite broadly here to embrace not only simple summaries like totals and means, but also more complex summaries: the principal components of a set of features and cluster structures in a set of entities. Similarly, correlation here covers both bivariate and multivariate relations between input and target features including classification trees and Bayes classifiers. Another feature of the class is that its main thrust is to give an in-depth understanding of a few basic techniques rather than to cover a broad spectrum of approaches developed so far. Most of the described methods fall under the same least-squares paradigm for mapping an “idealized” structure to the data. This allows me to bring forward a number of mathematically derived relations between methods that are usually overlooked.
Learning Objectives
- To give a student basic knowledge and competence in modern English language and style for technical discussions of data analysis and data mining problems on the international scene.
- To provide a unified framework and system for capturing the mainstream of numerous data analysis approaches and methods developed so far.
- To teach modern methods of data analysis including cutting edge techniques such as intelligent clustering, spectral clustering, consensus clustering, community detection, SVD and principal component analysis, and using bootstrapping for validation and comparison of averages, and evolutionary optimization techniques.
- To give a hands-on experience in real-world data analysis.
- To provide an experience in using modern computational tools and computation.
Expected Learning Outcomes
- Students know methods and their theoretical underpinnings for matrix and probabilistic data models.
- Students know methods and their theoretical underpinnings for K-Means clustering, including rules for its initialization and interpretation.
- Students know methods and their theoretical underpinnings for comparing cluster means with computational validation techniques such as bootstrappings.
- Students know methods and their theoretical underpinnings for interpreting clusters in nominal scales, Quetelet indexes and Pearson’s Chi-squared.
- Students know methods and their theoretical underpinnings for clustering similarity and network data including community detection, spectral clustering, and consensus clustering.
- Students know methods and their theoretical underpinnings for mixed scales data, quantification, pre-processing, standardization.
- Students know methods and their theoretical underpinnings for hierarchical clustering.
- Students know methods and their theoretical underpinnings for principal component analysis (PCA), SVD and data visualization.
- Students know methods and their theoretical underpinnings for matrices of covariance and correlation indexes; conventional formulation for PCA.
- Students know methods and their theoretical underpinnings for spectral clustering,
- Students know methods and their theoretical underpinnings for correlation and determinacy indexes at different perspective.
Course Contents
- Intro: course contents and administration
- Data table. Feature modeling. Feature as mapping. Probability feature model.<br /> Categorical data: probability and frequency. Conditional probability; independence; Bayes theorem.<br /> Continuous distribution and density function. Mean and variance. Random sample. Distribution of the sample mean. Central limit theorem.Quantitative coding for mixed scales. Elements of matrix theory: linear subspaces and principal directions. Least-squares approximation.
Full system of events: nominal feature. Bivariate distribution and contingency table.
Popular distributions (Gaussian, Power law, Poisson, Bernoulli).
Chi-squared distribution. Distribution of the sample variance. - K-Means clustering: method and propertiesClustering criterion and its reformulations. K-Means clustering as alternating minimization; Nature inspired algorithms for K-Means; Partition around medoids PAM; Choosing the number of clusters; Initialization of K-Means; Anomalous pattern and Intelligent K-Means.
- Cluster interpretation: comparison of means, bootstrap for confidence intervalsCluster interpretation aids.
- Cluster interpretation at categorical features, Pearson chi-squared, Quetelet indexes
- Principal component analysis (PCA), Singular value decomposition (SVD), using PCA for data visualization
- PCA: covariance and correlation matrices, meaning and properties of correlation coefficient in three perspectives; conventional formulation of PCA
- Clustering similarity and network data; k-means converted criterion and algorithms
- Consensus clustering; two criteria; reduction to network clustering
- Spectral clustering
Bibliography
Recommended Core Bibliography
- Mirkin, B. Core concepts in data analysis: summarization, correlation and visualization. – Springer Science & Business Media, 2011. – 388 pp.
Recommended Additional Bibliography
- Grünwald, Peter D. The minimum description length principle. – MIT press, 2007. – 736 pp.
- Hall, M., Witten, Ian H., Frank, E. Data Mining: practical machine learning tools and techniques. – 2011. – 664 pp.
- Han, J., Kamber, M., Pei, J. Data Mining: Concepts and Techniques, Third Edition. – Morgan Kaufmann Publishers, 2011. – 740 pp.
- Larose, D. T., Larose, C. D. Discovering knowledge in data: an introduction to data mining. – John Wiley & Sons, 2014. – 336 pp.
- Mazza, R. Introduction to information visualization. – Springer, 2009. – 139 pp.
- Scholkopf, B., Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. – MIT press, 2001. – 648 pp.
- Webb, A. R. Statistical pattern recognition. – John Wiley & Sons, 20011. – 668 pp.
- Witten, I. H. et al. Data Mining: Practical machine learning tools and techniques. – Morgan Kaufmann, 2017. – 654 pp.