Master
2020/2021
Unstructured Data Analysis
Category 'Best Course for Career Development'
Category 'Best Course for Broadening Horizons and Diversity of Knowledge and Skills'
Category 'Best Course for New Knowledge and Skills'
Type:
Elective course (Applied Statistics with Network Analysis)
Area of studies:
Applied Mathematics and Informatics
Delivered by:
International Laboratory for Applied Network Research
When:
1 year, 1, 2 module
Mode of studies:
offline
Instructors:
Ilia Karpov
Master’s programme:
Applied Statistics with Network Analysis
Language:
English
ECTS credits:
8
Contact hours:
48
Course Syllabus
Abstract
This course focuses on applied methods and existing tools for information retrieval: web scrap-ing, data preprocessing, natural language processing. All methods considered in this course require basic knowledge of discrete mathematics and probabilistic theory. For instance, most NLP and IR methods use conditional probability. In this course, we show the implementation of contemporary approaches in existing software packages (preferably in the python frameworks), and demonstrate how these methods can be used for the solution of some real-world problems.
Learning Objectives
- to show the implementation of contemporary approaches in existing software packages (preferably in the python frameworks), and demonstrate how these methods can be used for the solution of some real-world problems.
Expected Learning Outcomes
- know the basic principles behind the the existing deep learning approaches
- know advantages of existing natural language processing packages
- be able to get necessary data for research and applied projects
- be able to perform basic ETL operations with datasets and unstructured data
- be able to criticize constructively and determine existing issues with applied nlp tasks
- have an understanding of the basic principles of information retrieval
- have the skill to meaningfully develop an appropriate data analysis pipeline
- have the skill to work unstructured text data
Course Contents
- IR tasks overview, Python dive inLecture: The first session will discuss key IR tasks and show simple examples. We will also handle several issues with acquiring data from databases, files and web. Practical: Getting and serializing data from databases, files
- Web information extractionLecture: Web scraping techniques and tools. APIs and response formats. Practical: Creating simple web extraction script.
- Word embeddingsLecture: Word ambiguity problem, traditional and contemporary approaches in text representa-tion. Distributed semantics, Autoencoders architecture, word2vec, fasttext, bert. The notion of global and local optimization. Practical: word2vec, bert model training and fitting, basic text classification
- Text normalisationLecture: Text normalization problem, finite automate, conditional random fields, Practical: Text processing tools for Russian and English
- Syntax parsing, fact extractionLecture & Practical: Syntax parsing, text augmentation and generation
- Language modelling, text classification and clusteringLecture: Noisy channel model, spellchecking, Language modelling, text classification and clus-tering, cross-validation for classification estimation. Practical: Language modelling, text classification and clustering
- Sentiment detectionLecture: Sentiment detection with dictionaries, CNNs, RNNs. Sentiment detection as a classifi-cation problem Practical: Sentiment classifier development
- Text visualization methods and interfacesPractical: Historgams, Multidimension scaling, word graphs, highlight problem.
- Machine translation, question answeringLecture: Machine translation with markov models and recurrent neural networks, Seminar: Seq2seq training, Self-attention, Transformer. Analysis of attention heads in Transformer.
- Summarization and Domain adaptationLecture: transfer learning in text analysis, Knowledge Distillation. Abstract summarization and simplification, Rouge, SARI, BLUE, METEOR metrics
- Semantic search and indexingLecture: Elasticsearch queries, morphology parameters, cosine similarity, index density.
- Additional topics and course projects defenseLecture: Additional topics and course projects defense
Assessment Elements
- cumulative mark for the work during the modulusThe cumulative mark for the work during the modulus is based on the mark for the home tasks and on the activity during the seminars
- final examFinal exam can be replaced with course project. The grade for the course project must be set be-fore the final exam.
Interim Assessment
- Interim assessment (2 module)0.4 * cumulative mark for the work during the modulus + 0.6 * final exam
Bibliography
Recommended Core Bibliography
- Manning, C. D., & Schèutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: The MIT Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=24399
Recommended Additional Bibliography
- Cohen, S. (2016). Bayesian Analysis in Natural Language Processing. Morgan & Claypool Publishers.