2024/2025
Базовые методы анализа данных
Статус:
Маго-лего
Когда читается:
1, 2 модуль
Охват аудитории:
для своего кампуса
Преподаватели:
Миркин Борис Григорьевич
Язык:
английский
Кредиты:
6
Контактные часы:
54
Course Syllabus
Abstract
Data analysis is to help the user in enhancing and augmenting knowledge of the domain as represented by the concepts and statements of relation between them. This view distinguishes this class from related subjects such as applied statistics, machine learning, data mining, etc. Two main pathways for knowledge discovery are: (1) summarization, for developing and augmenting concepts, and (2) correlation, for enhancing and establishing relations between concepts. The term summarization is understood quite broadly here to embrace not only simple summaries like totals and means, but also more complex summaries: the principal components of a set of features and cluster structures in a set of entities. Similarly, correlation here covers both bivariate and multivariate relations between input and target features including regression, classification trees and Bayesian classifiers. Another feature of the class is that its main thrust is in giving an in-depth presentation of a few basic techniques and their properties rather than to cover a broad spectrum of approaches developed so far. This allows me to bring forward a number of mathematically derived interpretation tools and relations between methods that are usually overlooked.
Learning Objectives
- • Learning modern methods of data analysis including cutting edge techniques such as intelligent clustering, spectral clustering, consensus clustering, community detection, SVD and principal component analysis, and using bootstrapping for validation and comparison of averages, and evolutionary optimization techniques
- Obtaining a hands-on experience in real-world data analysis
Expected Learning Outcomes
- Data types (numerical, categorical); quantification of categorical data
- Methods for linear regression and discrimination
- Methods for principal component analysis, both model-based and conventional, related issues in singular value decomposition and spectral analysis
- K-means clustering together with its intelligent versions for determining the number of clusters and their initial location, as well as rules for interpretation of clusters
- Assessment of associations in contingency tables by using operationally interpretable Quetelet indexes and Pearson’s chi-squared
- Similarity data clustering including spectral clustering and consensus clustering
Course Contents
- Intro
- Data table
- Probability feature model
- Linear correlation structures.
- Bayesian approach to data analysis
- Neural networks
- Principal component analysis
- K-Means clustering
- Cluster interpretation
- Hierarchical clustering
- Clustering similarity and network data
- Individual home project.
Assessment Elements
- Home ProjectHome project’s goal is to teach a student to practical use of data analysis methods under study applied to a real-world data table taken from the internet or any other data source; the data should include at least 100 entities over at least 7 features and must be approved by the instructor. A project may involve a team of more than one – two or three individuals, if approved by the instructor. A project includes a number of tasks to be carried out by the team after the corresponding method has been explained in a lecture. A usual set of tasks includes: Principal Component Analysis (20%), Linear regression and effects of false correlation (15%), Validation and comparison of means by bootstrapping (15%), Application of k-means clustering (20%), Forming and analyzing contingency table(20%), altogether 90%. Plus 10% for the quality of a report written in Word or any other popular editing program (Adobe or Latex, for example). There are two deadlines set by the instructor. First deadline is for teams to form and present their dataset, for instructor’s approval. This deadline is set within 4-6 weeks after starting the class. Those teams missing this deadline will have their mark for the Homework project reduced by 1 grade (out of 10). The second deadline is for homework report submission. It is specified to fall in the final month of studies of the class, usually, two or three weeks before the Exam date. The mark of those submitting late but within two weeks of the deadline is reduced by 20%. The mark of those submitting even later is reduced by 50%.
- ExamExam paper is an in-class test of approximately 80-100 minutes. This includes about 6 or 7 questions, two of which are theoretical, and four or five, so-to-speak “practical” (examples of “practical” questions are given in the next section). One more question relates to a task in the individual Home project. To make cheating more difficult, there are 6-7 versions of parameter setting in the paper. Each question is assigned with the maximum mark; the total of the marks is 100. At marking, each answer is evaluated according to the level of coverage of the related material within the maximum mark assigned to the question. The same rules apply to those who retake the exam. The sum of question marks is the Exam mark, per cent, to be rounded into a conventional 10-grade scale. The rounding goes along the fairness criteria: say, 52 and 53 are rounded into 5; 58 and 59, into 6. The rounding of other marks takes into account the student’s discipline including the presence/absence at lectures and seminars. A student who missed more than half of the sessions should not be surprised if their 56% mark is rounded down to 5. Some exam questions may involve simple calculations that can be done with the help of electronic calculators; no more complex devices are permitted in the Exam – neither notebooks, nor smartphones, nor tablets: exact numerical solutions are not necessary for a successful answer.
Bibliography
Recommended Core Bibliography
- Core data analysis : summarization, correlation, and visualization, Mirkin, B., 2019
Recommended Additional Bibliography
- Grus, J. (2019). Data Science From Scratch : First Principles with Python (Vol. Second edition). Sebastopol, CA: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2102311