Master
2022/2023
Large Scale Machine Learning 1
Type:
Elective course
Area of studies:
Applied Mathematics and Informatics
Delivered by:
Big Data and Information Retrieval School
Where:
Faculty of Computer Science
When:
2 year, 2 module
Mode of studies:
distance learning
Online hours:
82
Open to:
students of one campus
Instructors:
Anatoly Bardukov
Master’s programme:
Master of Data Science
Language:
English
ECTS credits:
4
Contact hours:
8
Course Syllabus
Abstract
Large-scale machine learning requires fundamental knowledge in the field of data storage and processing. You need to operate with data for which one machine with standard hardware characteristics is not enough. Examples of such data might be user logs for a particular service, a collection of media files, or Wikipedia articles. This 6-week course gives you knowledge about the main concepts and frameworks that are actively used in companies for which it is critical to analyze large amounts of data in the shortest possible time. These can be companies that own: - search engines (for example, Google, Yandex, Microsoft, Yahoo!, etc.), - social networks and blogs (Twitter, Linked In, etc.), - recommendation services (for example, Kinopoisk from Yandex). The average time to complete this course depends on your background, you might spend 10 to 20 hours per week.
To complete the course, students are supposed to have skills in classical algorithms and data structures, main concepts of machine learning, and Python programming.
Learning Objectives
- After taking this course, students should be able to: ● use distributed file system ● run tasks on a Hadoop cluster ● write code to run on a Hadoop cluster using Hadoop streaming tools ● use a high-level programming language to process large data on a computational cluster ● solve search, index and machine learning problems on a Hadoop cluster
Expected Learning Outcomes
- Be able to use distributed file system
- Be able to run tasks on a Hadoop cluster
- Be able to write code to run on a Hadoop cluster using Hadoop streaming tools
- Be able to use a high-level programming language to process large data on a computational cluster
- Be able to solve search, index and machine learning problems on a Hadoop cluster
Course Contents
- Big Data introduction
- MapReduce paradigm and Hadoop framework
- SQL over Big Data
- Apache Spark
- Apache Spark 2
- Machine Learning on Spark
Interim Assessment
- 2022/2023 2nd module0.15 * Staff Graded Assignment: User routes on the site + 0.7 * Programming Assignments + 0.15 * Quizzes
Bibliography
Recommended Core Bibliography
- Kienzler, R. (2017). Mastering Apache Spark 2.x - Second Edition (Vol. 2nd ed). Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562681
- White, T. (2015). Hadoop: The Definitive Guide : Storage and Analysis at Internet Scale: Vol. 4th edition. O’Reilly Media.
Recommended Additional Bibliography
- Jules S. Damji, Brooke Wenig, Tathagata Das, & Denny Lee. (2020). Learning Spark. O’Reilly Media.