Инжиниринг данных и сервисов для автоматизации бизнес процессов

Магистратура 2020/2021

Статус: Курс по выбору (Науки о данных)

Направление: 01.04.02. Прикладная математика и информатика

Кто читает: Департамент анализа данных и искусственного интеллекта

Где читается: Факультет компьютерных наук

Когда читается: 2-й курс, 1, 2 модуль

Формат изучения: без онлайн-курса

Преподаватели: Незнанов Алексей Андреевич, Паринов Андрей Андреевич

Прогр. обучения: Науки о данных

Язык: английский

Кредиты: 8

Контактные часы: 56

Full Syllabus Ask Question

Abstract

Machine learning is changing the world rapidly and dramatically, every modern enterprise is now eyeing machine learning as one of the top instruments to improve business KPIs. Yet, behind any successful application of machine learning is a large chunk of work that is done by engineers, which includes Data Engineering functions such as data cleaning, wrangling, integration, etc. And the models must be deployed in production as reliable services. And finally, advanced analytics will need to take place in order to understand how the service is operating. In this course you will learn the basics of these engineering and analytic disciplines. We won’t focus on machine learning algorithms in this course, its a prerequisite.

Learning Objectives

To gain basic proficiency in data engineering, understand the key concepts, technologies and challenges of this subject area.

Expected Learning Outcomes

Understanding of course content.
Understanding of relational model, SQL, its power and its limitations.
Understanding of non-relational database, when they should be used, what are their strengths and weaknesses.
Understanding of different Enterprise Architectures for real-time online businesses, various trade-offs of using each type of architecture.
Understanding of basic reliability and durability mechanisms used in database and streaming systems.
Understanding of query processing and optimisation in relational systems, ability to reason about and optimise query plans.
Understanding of Big Data technologies, including Hadoop and Spark stack and massively parallel DBMSs.
Basic understanding of problems in data integration and data cleaning, familiarity of ETL processes and data warehouses.
Understanding of key aspect of reliability of ML services and key technologies to build a reliable machine learning service.
Understanding of advanced anomaly detection and collective learning techniques and their applications in building machine learning services.

Course Contents

Introduction
Here we’ll learn why its hard to train a machine learning model and quickly put it into production and embark on another project. What are the extra problems that creep up during this process? What extra risks appear when the model is transferred to production mode? We’ll do an overview of general decision systems based on Data Science. We’ll also dive into a specific business scenario, that will be the guiding example in our course: online credit business. We will go over the business model, major KPIs, the constraints the business places on possible machine learning solution and some fundamental problems.
Relational Data Model and Databases
Data in modern businesses comes in a variety of different types, from basic textual and numeric data, to geographical data, images, videos, timeseries, etc. We will go over basic data types and show how their are best used in machine learning tasks. Then we’ll dive into detail into relational data models.
Non-relational Databases
We’ll dive into detail into non-relational data models.
Event-based data models. Kappa and Lambda architectures. Process mining.
Typical business can be described as a set of business processes, and the event-based data model captures all important events, generated by these processes. Log of such events is at the core of modern real-time architectures such as Lambda and Kappa. We’ll study how to recover all the needed data from the event log, how to test hypothesis on top of such a log. We’ll create usable data marts on top of event logs for analytics. We’ll study advanced analytics techniques such as process mining and cohort analysis.
Durability and Reliability of Databases and Streaming Systems
We'll learn the basic reliability and durability mechanisms used in database and streaming systems.
Query Processing in Relational Systems
We'll learn query processing and optimisation in relational systems, ability to reason about and optimise query plans.
Big Data
Big Data technologies, Hadoop and Spark stack and massively parallel DBMSs.
Data Integration and cleaning
What are the typical problems with data quality? How can we increase data quality? Data integration problem: semantic data integration, virtual data integration.
Building a reliable ML service
Key aspect of reliability of ML services and key technologies to build a reliable machine learning service.
Anomaly detection and collective learning
Advanced anomaly detection and collective learning techniques and their applications in building machine learning services.

Assessment Elements

Programming task 1
Programming task 2
Exam
You can receive full credit for the final automatically, if you do well on all the assignments.
Programming task 1
Programming task 2
Exam
You can receive full credit for the final automatically, if you do well on all the assignments.

Interim Assessment

Interim assessment (2 module)
0.2 * Exam + 0.4 * Programming task 1 + 0.4 * Programming task 2

Bibliography

Recommended Core Bibliography

Harrington, J. L. Relational database design and implementation. – Morgan Kaufmann, 2016. – 441 pp.

Recommended Additional Bibliography

Xu Z. et al. (ed.). Big Data: 6th CCF Conference, Big Data 2018, Xi'an, China, October 11-13, 2018, Proceedings. – Springer, 2018. – Vol. 945.

Course Syllabus