Бакалавриат
2020/2021
Распределенные базы и хранилища данных
Статус:
Курс по выбору (Программная инженерия)
Направление:
09.03.04. Программная инженерия
Кто читает:
Департамент программной инженерии
Где читается:
Факультет компьютерных наук
Когда читается:
3-й курс, 3, 4 модуль
Формат изучения:
без онлайн-курса
Преподаватели:
Брейман Александр Давидович
Язык:
английский
Кредиты:
5
Контактные часы:
60
Course Syllabus
Abstract
Course presents a detailed introduction into distributed data processing, relational data ware-houses, multidimensional OLAP tools and massive parallel data processing systems (Hadoop, Cassandra, MongoDB). Students will develop understanding in the design methodology for distributed databases and data warehouses. Practice studies include implementing databases and applications software in map/reduce paradigm and in several NoSQL data models
Learning Objectives
- The objective of the course is to form professional competencies related to design and imple-mentation of several kinds of distributed databases, including data warehouses, online analyti-cal data processing and big data management tools. Students will get a grasp on strengths and weaknesses of wide spectrum of approaches to data storage, search and retrieval, resulting in informed choice of database model. This course studies different conceptual database models and their properties. The models that will be discussed are: • Relational data warehouse; • Multidimensional data warehouse; • Online analytical processing; • Map/reduce massive parallel data processing; • Key/value, document, graph and wide columnar database models; • Data stream processing. For these conceptual models the course will concentrate on the following points: Why was the database model introduced? Which of the shortcomings of other models does it address? What are the most important concepts and notions for the database model? How is the model imple-mented? Which are the main techniques? The importance of understanding the internals of a particular database model cannot be overemphasized as it is closely connected to its limitations.
Expected Learning Outcomes
- Explain distributed database systems architecture and design
- Compare databases and data warehouses
- Discuss the basic concepts of Data warehousing and OLAP technology
- Discuss the multidimensional model and present its main characteristics, components and operations
- Create multidimensional models
- Construct Data Cleaning Pipelines
- Design Data Integration Solutions
- Use key/value data storages
- Create document databases, fill it with data, retrieve documents
- Understand use cases for HDFS and Hadoop
- Construct MapReduce programs for execution under YARN
- Employ HDFS for file storage and retrieval
- Discuss Cassandra concepts and architecture
- Create data models for Cassandra
- Write Queries and Statements in CQL
- Discuss applicability of in-memory databases
- Describe storage details of in-memory databases
- Describe implementation of in-memory database operations
- Create Data Models for Streams
- Construct Streaming Processing Programs
Course Contents
- Data Warehousing and Big Data ManagementHistory of Data Management Approaches. A Historical Overview of Data Warehousing. Conceptual Database Design, Logical Database Design, The Relational Model, Normalization, Relational Query Languages, Physical Database Design.
- Data Warehousing Architectures and ModelsData Warehouse Concepts, Multidimensional Model, Hierarchies, Measures, OLAP Operations Data Warehouse Architecture, Back-End Tier, Data Warehouse Tier, OLAP Tier, Front-End Tier, Staging, ODS Conceptual Data Warehouse Design, Hierarchies, Balanced Hierarchies, Unbalanced Hierarchies, Generalized Hierarchies, Alternative Hierarchies, Parallel Hierarchies, Nonstrict Hierarchies,Facts with Multiple Granularities, Many-to-Many Dimensions Logical Data Warehouse Design, Relational Data Warehouse Design, Time Dimension, Logical Representation of Hierarchies, Slowly Changing Dimensions SQL/OLAP Operations, ROLLUP, CUBE, and GROUPING SETS, Window Functions Querying Data Warehouses, MDX, Comparison of MDX and SQL Physical Data Warehouse Design, Materialized Views, PipeSort Algorithm, Cube Size Estimation, Indexes for Data Warehouses, Bitmap Indexes, Join Indexes, Data Warehouse Partitioning, Partitioning Strategies Extraction, Transformation, and Loading
- Data Cleaning And IntegrationData Qualty Criteria: Validity, accuracy,, completeness, consistency, uniformity. Data Audit, Parsing, Data Transformation, Duplicate Elimination. Mediator/Wrapper Architecture. Data Integration:LAV, GAV, GLAV.
- Key/Value and Document DatabasesMinimal and Structured Key/Value Databases, Operations. Document database MongoDB concepts and architecture, CRUD operations, Aggregation framwork, Map/reduce, Data Modeling, Indexing, Replication, Sharding.
- Map/Reduce and HadoopCore Hadoop concepts, HDFS architecture, replication, reads and writes, HDFS commands, MapReduce program structure, Data formats for MapReduce, YARN architecture
- Large-scale Distributed DatabasesCore Cassandra concepts, architecture, replication, eventual consistency, Write path, Hinted handoff, Read path, Read repair, Node repair, Compaction and tombstones, Primary and partition keys, CQL, Transactions, Query-driven data modeling, Table and key design, Secondary Indexes
- In-memory DatabasesData Storage In Main Memory, Column Orientation, Dictionary Encoding, Compression, Data Layout, Partition, Database Operators, Materialization Strategies.
- Data Streams ManagementPush-based (streaming) computation model, Data Stream Management Systems Implementation Choices, Stream Data Models, Stream Query Languages, Operators over Unbounded Streams and Windows.
Assessment Elements
- Work on Seminars
- Quiz
- Midterm Test
- Essay
- ExamЭкзамен в MS Teams. В качестве резервного канала будет одновременно запущена встреча в Zoom. Письменный экзамен. Асинхронный прокторинг
- Group Project
Interim Assessment
- Interim assessment (4 module)0.07 * Essay + 0.5 * Exam + 0.14 * Group Project + 0.07 * Midterm Test + 0.07 * Quiz + 0.15 * Work on Seminars
Bibliography
Recommended Core Bibliography
- Berg, Silvia, P., & Frye, R. (2016). SAP HANA : An Introduction (Vol. Fourth edition). Bonn: SAP PRESS. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1350145
- Golab, L., & Özsu, M. T. (2010). Data Stream Management. [San Rafael, Calif.]: Morgan & Claypool Publishers. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=440359
- Jukic, N., Vrbsky, S., & Nestorov, S. (2017). Database Systems : Introduction to Databases and Data Warehouses. Burlington, Virginia: Prospect Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562389
- Parsian, M. (2015). Data Algorithms : Recipes for Scaling Up with Hadoop and Spark. [Sebastopol, CA]: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1028927
Recommended Additional Bibliography
- Antony, B., Boudnik, K., Adams, C., Shao, B., Lee, C., & Sasaki, K. (2016). Professional Hadoop. Indianapolis, IN: Wrox. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1233763
- Carpenter, J., & Hewitt, E. (2016). Cassandra: The Definitive Guide : Distributed Data at Web Scale (Vol. Second edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1271661
- Deka, G. C. (2017). NoSQL : Database for Storage and Retrieval of Data in Cloud. Boca Raton, FL: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1521297
- Deshpande, T. (2014). Mastering DynamoDB. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=836700
- Doan, A., Halevy, A., & Ives, Z. G. (2012). Principles of Data Integration. [Waltham, MA]: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=465063
- Edward, S. G., & Sabharwal, N. (2015). Practical MongoDB : Architecting, Developing, and Administering MongoDB. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1124206
- Fowler, A. (2015). NoSQL For Dummies. Hoboken, NJ: For Dummies. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=942547
- Ganti, V., & Das Sarma, A. (2013). Data Cleaning : A Practical Perspective. [San Rafael, California]: Morgan & Claypool Publishers. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=646833
- Haq, Q. M. R. U. (2016). Data Mapping for Data Warehouse Design. Amsterdam: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1115852
- Hows, D., Membrey, P., Plugge, E., & Hawkins, T. (2015). The Definitive Guide to MongoDB : A Complete Guide to Dealing with Big Data Using MongoDB (Vol. Third edition). [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1109647
- Hueske, F., & Kalavri, V. (2019). Stream Processing with Apache Flink : Fundamentals, Implementation, and Operation of Streaming Applications (Vol. First edition). Sebastopol, CA: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2102045
- Inmon, W., & Krishnan, K. (2011). Building the Unstructured Data Warehouse : Architecture, Analysis, and Design (Vol. First edition). Westfield: Technics Publications. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1005034
- Kimball, R., & Caserta, J. (2004). The Data Warehouse ETL Toolkit : Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Indianapolis, IN: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=124355
- Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit : The Definitive Guide to Dimensional Modeling (Vol. 3rd edition). Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=605991
- Kimball, Ralph, and Margy Ross. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons, 2013.
- KOROTKEVITCH, D. (2017). Expert SQL Server In-Memory OLTP (Vol. 2nd ed). Berkeley, CA: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1589522
- Krish Krishnan. (2019). Building Big Data Applications. [N.p.]: Academic Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1892146
- Linstedt, D., & Olschimke, M. (2015). Building a Scalable Data Warehouse with Data Vault 2.0. Amsterdam: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1065504
- Nabi, Z. (2016). Pro Spark Streaming : The Zen of Real-Time Analytics Using Apache Spark. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1174432
- Nelson, J. (2016). Mastering Redis. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1243702
- Nishant Garg. (2014). HBase Essentials. [N.p.]: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=906714
- Perkins, L., Redmond, E., & Wilson, J. R. (2018). Seven Databases in Seven Weeks : A Guide to Modern Databases and the NoSQL Movement (Vol. Second edition). Raleigh, N. C: Pragmatic Bookshelf. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1806794
- Ray Rankins, Paul Bertucci, Chris Gallelli, & Alex T. Silverstein. (2015). Microsoft SQL Server 2014 Unleashed. [N.p.]: Sams Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1601720
- Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, & Shuen Mei. (2018). Apache Spark 2: Data Processing and Real-Time Analytics : Master Complex Big Data Processing, Stream Analytics, and Machine Learning with Apache Spark. Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1991793
- Shrivastava, A., & Deshpande, T. (2016). Hadoop Blueprints. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1364692
- Tae, K. H., Roh, Y., Oh, Y. H., Kim, H., & Whang, S. E. (2019). Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsarx&AN=edsarx.1904.10761