Распределенные базы и хранилища данных

Бакалавриат 2022/2023

Лучший по критерию «Полезность курса для Вашей будущей карьеры»

Лучший по критерию «Полезность курса для расширения кругозора и разностороннего развития»

Лучший по критерию «Новизна полученных знаний»

Статус: Курс по выбору (Программная инженерия)

Направление: 09.03.04. Программная инженерия

Кто читает: Департамент программной инженерии

Где читается: Факультет компьютерных наук

Когда читается: 3-й курс, 3, 4 модуль

Формат изучения: без онлайн-курса

Охват аудитории: для своего кампуса

Преподаватели: Брейман Александр Давидович

Язык: русский

Кредиты: 5

Контактные часы: 60

Дополнительные материалы в LMS Задать вопрос

Аннотация

Course presents a detailed introduction into distributed data processing, relational data ware-houses, multidimensional OLAP tools and massive parallel data processing systems (Hadoop, Cassandra, MongoDB). Students will develop understanding in the design methodology for distributed databases and data warehouses. Practice studies include implementing databases and applications software in map/reduce paradigm and in several NoSQL data models

Цель освоения дисциплины

The objective of the course is to form professional competencies related to design and imple-mentation of several kinds of distributed databases, including data warehouses, online analyti-cal data processing and big data management tools. Students will get a grasp on strengths and weaknesses of wide spectrum of approaches to data storage, search and retrieval, resulting in informed choice of database model. This course studies different conceptual database models and their properties. The models that will be discussed are: • Relational data warehouse; • Multidimensional data warehouse; • Online analytical processing; • Map/reduce massive parallel data processing; • Key/value, document, graph and wide columnar database models; • Data stream processing. For these conceptual models the course will concentrate on the following points: Why was the database model introduced? Which of the shortcomings of other models does it address? What are the most important concepts and notions for the database model? How is the model imple-mented? Which are the main techniques? The importance of understanding the internals of a particular database model cannot be overemphasized as it is closely connected to its limitations.

Планируемые результаты обучения

Compare databases and data warehouses
Construct Data Cleaning Pipelines
Construct MapReduce programs for execution under YARN
Construct Streaming Processing Programs
Create data models for Cassandra
Create Data Models for Streams
Create document databases, fill it with data, retrieve documents
Create multidimensional models
Describe implementation of in-memory database operations
Describe storage details of in-memory databases
Design Data Integration Solutions
Discuss applicability of in-memory databases
Discuss Cassandra concepts and architecture
Discuss the basic concepts of Data warehousing and OLAP technology
Discuss the multidimensional model and present its main characteristics, components and operations
Employ HDFS for file storage and retrieval
Explain distributed database systems architecture and design
Understand use cases for HDFS and Hadoop
Use key/value data storages
Write Queries and Statements in CQL

Содержание учебной дисциплины

Data Warehousing and Big Data Management
Data Warehousing Architectures and Models
Data Cleaning And Integration
Key/Value and Document Databases
Map/Reduce and Hadoop
Large-scale Distributed Databases
In-memory Databases
Data Streams Management

Элементы контроля

Work on Seminars
Quiz
Midterm Test
Essay
Group Project
Exam

Промежуточная аттестация

2022/2023 4th module
0.15 * Work on Seminars + 0.07 * Midterm Test + 0.07 * Essay + 0.5 * Exam + 0.14 * Group Project + 0.07 * Quiz

Список литературы

Рекомендуемая дополнительная литература

Antony, B., Boudnik, K., Adams, C., Shao, B., Lee, C., & Sasaki, K. (2016). Professional Hadoop. Indianapolis, IN: Wrox. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1233763
Carpenter, J., & Hewitt, E. (2016). Cassandra: The Definitive Guide : Distributed Data at Web Scale (Vol. Second edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1271661
Deka, G. C. (2017). NoSQL : Database for Storage and Retrieval of Data in Cloud. Boca Raton, FL: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1521297
Deshpande, T. (2014). Mastering DynamoDB. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=836700
Doan, A., Halevy, A., & Ives, Z. G. (2012). Principles of Data Integration. [Waltham, MA]: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=465063
Edward, S. G., & Sabharwal, N. (2015). Practical MongoDB : Architecting, Developing, and Administering MongoDB. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1124206
Fowler, A. (2015). NoSQL For Dummies. Hoboken, NJ: For Dummies. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=942547
Ganti, V., & Das Sarma, A. (2013). Data Cleaning : A Practical Perspective. [San Rafael, California]: Morgan & Claypool Publishers. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=646833
Haq, Q. M. R. U. (2016). Data Mapping for Data Warehouse Design. Amsterdam: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1115852
Hows, D., Membrey, P., Plugge, E., & Hawkins, T. (2015). The Definitive Guide to MongoDB : A Complete Guide to Dealing with Big Data Using MongoDB (Vol. Third edition). [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1109647
Hueske, F., & Kalavri, V. (2019). Stream Processing with Apache Flink : Fundamentals, Implementation, and Operation of Streaming Applications (Vol. First edition). Sebastopol, CA: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2102045
Inmon, W., & Krishnan, K. (2011). Building the Unstructured Data Warehouse : Architecture, Analysis, and Design (Vol. First edition). Westfield: Technics Publications. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1005034
Kimball, R., & Caserta, J. (2004). The Data Warehouse ETL Toolkit : Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Indianapolis, IN: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=124355
Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit : The Definitive Guide to Dimensional Modeling (Vol. 3rd edition). Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=605991
Kimball, Ralph, and Margy Ross. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons, 2013.
KOROTKEVITCH, D. (2017). Expert SQL Server In-Memory OLTP (Vol. 2nd ed). Berkeley, CA: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1589522
Krish Krishnan. (2019). Building Big Data Applications. [N.p.]: Academic Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1892146
Linstedt, D., & Olschimke, M. (2015). Building a Scalable Data Warehouse with Data Vault 2.0. Amsterdam: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1065504
Nabi, Z. (2016). Pro Spark Streaming : The Zen of Real-Time Analytics Using Apache Spark. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1174432
Nelson, J. (2016). Mastering Redis. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1243702
Nishant Garg. (2014). HBase Essentials. [N.p.]: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=906714
Perkins, L., Redmond, E., & Wilson, J. R. (2018). Seven Databases in Seven Weeks : A Guide to Modern Databases and the NoSQL Movement (Vol. Second edition). Raleigh, N. C: Pragmatic Bookshelf. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1806794
Ray Rankins, Paul Bertucci, Chris Gallelli, & Alex T. Silverstein. (2015). Microsoft SQL Server 2014 Unleashed. [N.p.]: Sams Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1601720
Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, & Shuen Mei. (2018). Apache Spark 2: Data Processing and Real-Time Analytics : Master Complex Big Data Processing, Stream Analytics, and Machine Learning with Apache Spark. Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1991793
Shrivastava, A., & Deshpande, T. (2016). Hadoop Blueprints. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1364692
Tae, K. H., Roh, Y., Oh, Y. H., Kim, H., & Whang, S. E. (2019). Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsarx&AN=edsarx.1904.10761

Авторы

Брейман Александр Давидович
Петрухина Анастасия Сергеевна

Программа дисциплины