Магистратура
2022/2023
Компьютерный анализ текстов: продвинутый уровень
Лучший по критерию «Полезность курса для Вашей будущей карьеры»
Лучший по критерию «Полезность курса для расширения кругозора и разностороннего развития»
Лучший по критерию «Новизна полученных знаний»
Статус:
Курс обязательный (Анализ данных для государства и общества)
Направление:
41.04.04. Политология
Кто читает:
Департамент социологии
Где читается:
Санкт-Петербургская школа социальных наук
Когда читается:
2-й курс, 1 модуль
Формат изучения:
без онлайн-курса
Охват аудитории:
для своего кампуса
Преподаватели:
Кольцов Сергей Николаевич
Прогр. обучения:
Анализ данных для государства и общества
Язык:
английский
Кредиты:
3
Контактные часы:
24
Course Syllabus
Abstract
This course covers a wide range of machine learning algorithms for textual data analysis. The first part of the course deals with preprocessing procedures for text data, which include lematizers for different languages, procedures for vectorization of text data. This course also deals with the work of classifiers for textual analysis (and measures the quality of classifiers).
The second part of the course focuses on the work of flat and hierarchical topic models (measures of quality: coherence, perplexity, loglikellyhood, stability, Renyi entropy). In addition, this course explores the concept of 'word embedings' for textual analysis (topic modeling).
In the third part of the course, the work with neural networks for textual data analysis based on the TensorFlow framework with the Keras add-in is considered. All the models discussed are provided with python scripts. At the end of the course students have to present their work on data analysis in the form of a presentation and scripts.
Learning Objectives
- Learn algorithms and their main advantages and limitations in terms of text data analysis
- Obtain skills to work with machine learning software / cod
- Be able to work with text data.
Expected Learning Outcomes
- Have skills to analyze textual data
- Analyze data with machine learning tools
- Do textual preprocessing (lemmatization and tokenization)
- Present the resulting project in terms of machine learning
- Visualize results of the analysis
Course Contents
- Objectives of text analysis - preprocessing, lematization-vectorization.
- Overview of classical classifiers such as KNN, Random Forrest, SVM
- Bayesian classification for sentiment analysis or topic definition.
- Topic modeling (plane), quality metrics (Coherence, Perplexity, Loglokellyhood, stability, Renyi entropy), review of some libraries.
- Topic modeling (hierarchical models, discussion of problems).
- Embedings (gensim), what are word embeddings, how to work with words embedings.
- Topic models with embedings (ETM, GLDAW).
- Introduction to neuron networks (Tensorflow, keras) - the basics of working with Keras, an overview of some neural networks.
- Preprocessing of text data for neural networking.
- Working with recurrent neural networks for textual analysis.
- Working with LSTM neural networks for textual analysis.
- Model with multiple outputs (heads).
- Presentation of student work.
Assessment Elements
- ExamThe exam is a competition (hakaton) to develop the best model of sentiment analysis for the Russian-language text. The essence of the competition is as follows. At the end of the first part of the course a Russian-language dataset with sentiment scores will be given. Students must train their classification models on this dataset. A week before the exam, students will receive the second part of the dataset, which they must use to test the models they have learned. On the exam, students give a presentation on their models. The grade for the presentation depends, first, on the level of presentation. Second, the grade depends on the results obtained (level of model learning and number of models).
- Homework
Bibliography
Recommended Core Bibliography
- Sebastian Raschka, & Vahid Mirjalili. (2019). Python Machine Learning : Machine Learning and Deep Learning with Python, Scikit-learn, and TensorFlow 2, 3rd Edition. Packt Publishing.
Recommended Additional Bibliography
- Miroslav Kubat. (2017). An Introduction to Machine Learning (Vol. 2nd ed. 2017). Springer.