Распределенные базы и хранилища данных

Бакалавриат 2020/2021

Статус: Курс по выбору (Программная инженерия)

Направление: 09.03.04. Программная инженерия

Кто читает: Департамент программной инженерии

Где читается: Факультет компьютерных наук

Когда читается: 3-й курс, 3, 4 модуль

Формат изучения: без онлайн-курса

Преподаватели: Брейман Александр Давидович

Язык: английский

Кредиты: 5

Контактные часы: 60

Full Syllabus Ask Question

Abstract

Course presents a detailed introduction into distributed data processing, relational data ware-houses, multidimensional OLAP tools and massive parallel data processing systems (Hadoop, Cassandra, MongoDB). Students will develop understanding in the design methodology for distributed databases and data warehouses. Practice studies include implementing databases and applications software in map/reduce paradigm and in several NoSQL data models

Learning Objectives

The objective of the course is to form professional competencies related to design and imple-mentation of several kinds of distributed databases, including data warehouses, online analyti-cal data processing and big data management tools. Students will get a grasp on strengths and weaknesses of wide spectrum of approaches to data storage, search and retrieval, resulting in informed choice of database model. This course studies different conceptual database models and their properties. The models that will be discussed are: • Relational data warehouse; • Multidimensional data warehouse; • Online analytical processing; • Map/reduce massive parallel data processing; • Key/value, document, graph and wide columnar database models; • Data stream processing. For these conceptual models the course will concentrate on the following points: Why was the database model introduced? Which of the shortcomings of other models does it address? What are the most important concepts and notions for the database model? How is the model imple-mented? Which are the main techniques? The importance of understanding the internals of a particular database model cannot be overemphasized as it is closely connected to its limitations.

Expected Learning Outcomes

Explain distributed database systems architecture and design
Compare databases and data warehouses
Discuss the basic concepts of Data warehousing and OLAP technology
Discuss the multidimensional model and present its main characteristics, components and operations
Create multidimensional models
Construct Data Cleaning Pipelines
Design Data Integration Solutions
Use key/value data storages
Create document databases, fill it with data, retrieve documents
Understand use cases for HDFS and Hadoop
Construct MapReduce programs for execution under YARN
Employ HDFS for file storage and retrieval
Discuss Cassandra concepts and architecture
Create data models for Cassandra
Write Queries and Statements in CQL
Discuss applicability of in-memory databases
Describe storage details of in-memory databases
Describe implementation of in-memory database operations
Create Data Models for Streams
Construct Streaming Processing Programs

Course Contents

Data Warehousing and Big Data Management
History of Data Management Approaches. A Historical Overview of Data Warehousing. Conceptual Database Design, Logical Database Design, The Relational Model, Normalization, Relational Query Languages, Physical Database Design.
Data Warehousing Architectures and Models
Data Warehouse Concepts, Multidimensional Model, Hierarchies, Measures, OLAP Operations Data Warehouse Architecture, Back-End Tier, Data Warehouse Tier, OLAP Tier, Front-End Tier, Staging, ODS Conceptual Data Warehouse Design, Hierarchies, Balanced Hierarchies, Unbalanced Hierarchies, Generalized Hierarchies, Alternative Hierarchies, Parallel Hierarchies, Nonstrict Hierarchies,Facts with Multiple Granularities, Many-to-Many Dimensions Logical Data Warehouse Design, Relational Data Warehouse Design, Time Dimension, Logical Representation of Hierarchies, Slowly Changing Dimensions SQL/OLAP Operations, ROLLUP, CUBE, and GROUPING SETS, Window Functions Querying Data Warehouses, MDX, Comparison of MDX and SQL Physical Data Warehouse Design, Materialized Views, PipeSort Algorithm, Cube Size Estimation, Indexes for Data Warehouses, Bitmap Indexes, Join Indexes, Data Warehouse Partitioning, Partitioning Strategies Extraction, Transformation, and Loading
Data Cleaning And Integration
Data Qualty Criteria: Validity, accuracy,, completeness, consistency, uniformity. Data Audit, Parsing, Data Transformation, Duplicate Elimination. Mediator/Wrapper Architecture. Data Integration:LAV, GAV, GLAV.
Key/Value and Document Databases
Minimal and Structured Key/Value Databases, Operations. Document database MongoDB concepts and architecture, CRUD operations, Aggregation framwork, Map/reduce, Data Modeling, Indexing, Replication, Sharding.
Map/Reduce and Hadoop
Core Hadoop concepts, HDFS architecture, replication, reads and writes, HDFS commands, MapReduce program structure, Data formats for MapReduce, YARN architecture
Large-scale Distributed Databases
Core Cassandra concepts, architecture, replication, eventual consistency, Write path, Hinted handoff, Read path, Read repair, Node repair, Compaction and tombstones, Primary and partition keys, CQL, Transactions, Query-driven data modeling, Table and key design, Secondary Indexes
In-memory Databases
Data Storage In Main Memory, Column Orientation, Dictionary Encoding, Compression, Data Layout, Partition, Database Operators, Materialization Strategies.
Data Streams Management
Push-based (streaming) computation model, Data Stream Management Systems Implementation Choices, Stream Data Models, Stream Query Languages, Operators over Unbounded Streams and Windows.

Assessment Elements

Work on Seminars
Quiz
Midterm Test
Essay
Exam
Экзамен в MS Teams. В качестве резервного канала будет одновременно запущена встреча в Zoom. Письменный экзамен. Асинхронный прокторинг
Group Project

Interim Assessment

Interim assessment (4 module)
0.07 * Essay + 0.5 * Exam + 0.14 * Group Project + 0.07 * Midterm Test + 0.07 * Quiz + 0.15 * Work on Seminars

Bibliography

Recommended Core Bibliography

Berg, Silvia, P., & Frye, R. (2016). SAP HANA : An Introduction (Vol. Fourth edition). Bonn: SAP PRESS. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1350145
Golab, L., & Özsu, M. T. (2010). Data Stream Management. [San Rafael, Calif.]: Morgan & Claypool Publishers. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=440359
Jukic, N., Vrbsky, S., & Nestorov, S. (2017). Database Systems : Introduction to Databases and Data Warehouses. Burlington, Virginia: Prospect Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562389
Parsian, M. (2015). Data Algorithms : Recipes for Scaling Up with Hadoop and Spark. [Sebastopol, CA]: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1028927

Recommended Additional Bibliography

Antony, B., Boudnik, K., Adams, C., Shao, B., Lee, C., & Sasaki, K. (2016). Professional Hadoop. Indianapolis, IN: Wrox. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1233763
Carpenter, J., & Hewitt, E. (2016). Cassandra: The Definitive Guide : Distributed Data at Web Scale (Vol. Second edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1271661
Deka, G. C. (2017). NoSQL : Database for Storage and Retrieval of Data in Cloud. Boca Raton, FL: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1521297
Deshpande, T. (2014). Mastering DynamoDB. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=836700
Doan, A., Halevy, A., & Ives, Z. G. (2012). Principles of Data Integration. [Waltham, MA]: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=465063
Edward, S. G., & Sabharwal, N. (2015). Practical MongoDB : Architecting, Developing, and Administering MongoDB. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1124206
Fowler, A. (2015). NoSQL For Dummies. Hoboken, NJ: For Dummies. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=942547
Ganti, V., & Das Sarma, A. (2013). Data Cleaning : A Practical Perspective. [San Rafael, California]: Morgan & Claypool Publishers. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=646833
Haq, Q. M. R. U. (2016). Data Mapping for Data Warehouse Design. Amsterdam: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1115852
Hows, D., Membrey, P., Plugge, E., & Hawkins, T. (2015). The Definitive Guide to MongoDB : A Complete Guide to Dealing with Big Data Using MongoDB (Vol. Third edition). [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1109647
Hueske, F., & Kalavri, V. (2019). Stream Processing with Apache Flink : Fundamentals, Implementation, and Operation of Streaming Applications (Vol. First edition). Sebastopol, CA: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2102045
Inmon, W., & Krishnan, K. (2011). Building the Unstructured Data Warehouse : Architecture, Analysis, and Design (Vol. First edition). Westfield: Technics Publications. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1005034
Kimball, R., & Caserta, J. (2004). The Data Warehouse ETL Toolkit : Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Indianapolis, IN: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=124355
Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit : The Definitive Guide to Dimensional Modeling (Vol. 3rd edition). Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=605991
Kimball, Ralph, and Margy Ross. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons, 2013.
KOROTKEVITCH, D. (2017). Expert SQL Server In-Memory OLTP (Vol. 2nd ed). Berkeley, CA: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1589522
Krish Krishnan. (2019). Building Big Data Applications. [N.p.]: Academic Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1892146
Linstedt, D., & Olschimke, M. (2015). Building a Scalable Data Warehouse with Data Vault 2.0. Amsterdam: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1065504
Nabi, Z. (2016). Pro Spark Streaming : The Zen of Real-Time Analytics Using Apache Spark. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1174432
Nelson, J. (2016). Mastering Redis. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1243702
Nishant Garg. (2014). HBase Essentials. [N.p.]: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=906714
Perkins, L., Redmond, E., & Wilson, J. R. (2018). Seven Databases in Seven Weeks : A Guide to Modern Databases and the NoSQL Movement (Vol. Second edition). Raleigh, N. C: Pragmatic Bookshelf. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1806794
Ray Rankins, Paul Bertucci, Chris Gallelli, & Alex T. Silverstein. (2015). Microsoft SQL Server 2014 Unleashed. [N.p.]: Sams Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1601720
Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, & Shuen Mei. (2018). Apache Spark 2: Data Processing and Real-Time Analytics : Master Complex Big Data Processing, Stream Analytics, and Machine Learning with Apache Spark. Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1991793
Shrivastava, A., & Deshpande, T. (2016). Hadoop Blueprints. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1364692
Tae, K. H., Roh, Y., Oh, Y. H., Kim, H., & Whang, S. E. (2019). Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsarx&AN=edsarx.1904.10761

Course Syllabus