We use cookies in order to improve the quality and usability of the HSE website. More information about the use of cookies is available here, and the regulations on processing personal data can be found here. By continuing to use the site, you hereby confirm that you have been informed of the use of cookies by the HSE website and agree with our rules for processing personal data. You may disable cookies in your browser settings.

  • A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
Bachelor 2020/2021

Distributed Databases and Data Warehouses

Type: Elective course (Software Engineering)
Area of studies: Software Engineering
When: 3 year, 3, 4 module
Mode of studies: offline
Instructors: Alexander Breyman
Language: English
ECTS credits: 5
Contact hours: 60

Course Syllabus

Abstract

Course presents a detailed introduction into distributed data processing, relational data ware-houses, multidimensional OLAP tools and massive parallel data processing systems (Hadoop, Cassandra, MongoDB). Students will develop understanding in the design methodology for distributed databases and data warehouses. Practice studies include implementing databases and applications software in map/reduce paradigm and in several NoSQL data models
Learning Objectives

Learning Objectives

  • The objective of the course is to form professional competencies related to design and imple-mentation of several kinds of distributed databases, including data warehouses, online analyti-cal data processing and big data management tools. Students will get a grasp on strengths and weaknesses of wide spectrum of approaches to data storage, search and retrieval, resulting in informed choice of database model. This course studies different conceptual database models and their properties. The models that will be discussed are: • Relational data warehouse; • Multidimensional data warehouse; • Online analytical processing; • Map/reduce massive parallel data processing; • Key/value, document, graph and wide columnar database models; • Data stream processing. For these conceptual models the course will concentrate on the following points: Why was the database model introduced? Which of the shortcomings of other models does it address? What are the most important concepts and notions for the database model? How is the model imple-mented? Which are the main techniques? The importance of understanding the internals of a particular database model cannot be overemphasized as it is closely connected to its limitations.
Expected Learning Outcomes

Expected Learning Outcomes

  • Explain distributed database systems architecture and design
  • Compare databases and data warehouses
  • Discuss the basic concepts of Data warehousing and OLAP technology
  • Discuss the multidimensional model and present its main characteristics, components and operations
  • Create multidimensional models
  • Construct Data Cleaning Pipelines
  • Design Data Integration Solutions
  • Use key/value data storages
  • Create document databases, fill it with data, retrieve documents
  • Understand use cases for HDFS and Hadoop
  • Construct MapReduce programs for execution under YARN
  • Employ HDFS for file storage and retrieval
  • Discuss Cassandra concepts and architecture
  • Create data models for Cassandra
  • Write Queries and Statements in CQL
  • Discuss applicability of in-memory databases
  • Describe storage details of in-memory databases
  • Describe implementation of in-memory database operations
  • Create Data Models for Streams
  • Construct Streaming Processing Programs
Course Contents

Course Contents

  • Data Warehousing and Big Data Management
    History of Data Management Approaches. A Historical Overview of Data Warehousing. Conceptual Database Design, Logical Database Design, The Relational Model, Normalization, Relational Query Languages, Physical Database Design.
  • Data Warehousing Architectures and Models
    Data Warehouse Concepts, Multidimensional Model, Hierarchies, Measures, OLAP Operations Data Warehouse Architecture, Back-End Tier, Data Warehouse Tier, OLAP Tier, Front-End Tier, Staging, ODS Conceptual Data Warehouse Design, Hierarchies, Balanced Hierarchies, Unbalanced Hierarchies, Generalized Hierarchies, Alternative Hierarchies, Parallel Hierarchies, Nonstrict Hierarchies,Facts with Multiple Granularities, Many-to-Many Dimensions Logical Data Warehouse Design, Relational Data Warehouse Design, Time Dimension, Logical Representation of Hierarchies, Slowly Changing Dimensions SQL/OLAP Operations, ROLLUP, CUBE, and GROUPING SETS, Window Functions Querying Data Warehouses, MDX, Comparison of MDX and SQL Physical Data Warehouse Design, Materialized Views, PipeSort Algorithm, Cube Size Estimation, Indexes for Data Warehouses, Bitmap Indexes, Join Indexes, Data Warehouse Partitioning, Partitioning Strategies Extraction, Transformation, and Loading
  • Data Cleaning And Integration
    Data Qualty Criteria: Validity, accuracy,, completeness, consistency, uniformity. Data Audit, Parsing, Data Transformation, Duplicate Elimination. Mediator/Wrapper Architecture. Data Integration:LAV, GAV, GLAV.
  • Key/Value and Document Databases
    Minimal and Structured Key/Value Databases, Operations. Document database MongoDB concepts and architecture, CRUD operations, Aggregation framwork, Map/reduce, Data Modeling, Indexing, Replication, Sharding.
  • Map/Reduce and Hadoop
    Core Hadoop concepts, HDFS architecture, replication, reads and writes, HDFS commands, MapReduce program structure, Data formats for MapReduce, YARN architecture
  • Large-scale Distributed Databases
    Core Cassandra concepts, architecture, replication, eventual consistency, Write path, Hinted handoff, Read path, Read repair, Node repair, Compaction and tombstones, Primary and partition keys, CQL, Transactions, Query-driven data modeling, Table and key design, Secondary Indexes
  • In-memory Databases
    Data Storage In Main Memory, Column Orientation, Dictionary Encoding, Compression, Data Layout, Partition, Database Operators, Materialization Strategies.
  • Data Streams Management
    Push-based (streaming) computation model, Data Stream Management Systems Implementation Choices, Stream Data Models, Stream Query Languages, Operators over Unbounded Streams and Windows.
Assessment Elements

Assessment Elements

  • non-blocking Work on Seminars
  • non-blocking Quiz
  • non-blocking Midterm Test
  • non-blocking Essay
  • non-blocking Exam
    Экзамен в MS Teams. В качестве резервного канала будет одновременно запущена встреча в Zoom. Письменный экзамен. Асинхронный прокторинг
  • non-blocking Group Project
Interim Assessment

Interim Assessment

  • Interim assessment (4 module)
    0.07 * Essay + 0.5 * Exam + 0.14 * Group Project + 0.07 * Midterm Test + 0.07 * Quiz + 0.15 * Work on Seminars
Bibliography

Bibliography

Recommended Core Bibliography

  • Berg, Silvia, P., & Frye, R. (2016). SAP HANA : An Introduction (Vol. Fourth edition). Bonn: SAP PRESS. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1350145
  • Golab, L., & Özsu, M. T. (2010). Data Stream Management. [San Rafael, Calif.]: Morgan & Claypool Publishers. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=440359
  • Jukic, N., Vrbsky, S., & Nestorov, S. (2017). Database Systems : Introduction to Databases and Data Warehouses. Burlington, Virginia: Prospect Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562389
  • Parsian, M. (2015). Data Algorithms : Recipes for Scaling Up with Hadoop and Spark. [Sebastopol, CA]: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1028927

Recommended Additional Bibliography

  • Antony, B., Boudnik, K., Adams, C., Shao, B., Lee, C., & Sasaki, K. (2016). Professional Hadoop. Indianapolis, IN: Wrox. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1233763
  • Carpenter, J., & Hewitt, E. (2016). Cassandra: The Definitive Guide : Distributed Data at Web Scale (Vol. Second edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1271661
  • Deka, G. C. (2017). NoSQL : Database for Storage and Retrieval of Data in Cloud. Boca Raton, FL: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1521297
  • Deshpande, T. (2014). Mastering DynamoDB. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=836700
  • Doan, A., Halevy, A., & Ives, Z. G. (2012). Principles of Data Integration. [Waltham, MA]: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=465063
  • Edward, S. G., & Sabharwal, N. (2015). Practical MongoDB : Architecting, Developing, and Administering MongoDB. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1124206
  • Fowler, A. (2015). NoSQL For Dummies. Hoboken, NJ: For Dummies. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=942547
  • Ganti, V., & Das Sarma, A. (2013). Data Cleaning : A Practical Perspective. [San Rafael, California]: Morgan & Claypool Publishers. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=646833
  • Haq, Q. M. R. U. (2016). Data Mapping for Data Warehouse Design. Amsterdam: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1115852
  • Hows, D., Membrey, P., Plugge, E., & Hawkins, T. (2015). The Definitive Guide to MongoDB : A Complete Guide to Dealing with Big Data Using MongoDB (Vol. Third edition). [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1109647
  • Hueske, F., & Kalavri, V. (2019). Stream Processing with Apache Flink : Fundamentals, Implementation, and Operation of Streaming Applications (Vol. First edition). Sebastopol, CA: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2102045
  • Inmon, W., & Krishnan, K. (2011). Building the Unstructured Data Warehouse : Architecture, Analysis, and Design (Vol. First edition). Westfield: Technics Publications. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1005034
  • Kimball, R., & Caserta, J. (2004). The Data Warehouse ETL Toolkit : Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Indianapolis, IN: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=124355
  • Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit : The Definitive Guide to Dimensional Modeling (Vol. 3rd edition). Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=605991
  • Kimball, Ralph, and Margy Ross. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons, 2013.
  • KOROTKEVITCH, D. (2017). Expert SQL Server In-Memory OLTP (Vol. 2nd ed). Berkeley, CA: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1589522
  • Krish Krishnan. (2019). Building Big Data Applications. [N.p.]: Academic Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1892146
  • Linstedt, D., & Olschimke, M. (2015). Building a Scalable Data Warehouse with Data Vault 2.0. Amsterdam: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1065504
  • Nabi, Z. (2016). Pro Spark Streaming : The Zen of Real-Time Analytics Using Apache Spark. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1174432
  • Nelson, J. (2016). Mastering Redis. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1243702
  • Nishant Garg. (2014). HBase Essentials. [N.p.]: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=906714
  • Perkins, L., Redmond, E., & Wilson, J. R. (2018). Seven Databases in Seven Weeks : A Guide to Modern Databases and the NoSQL Movement (Vol. Second edition). Raleigh, N. C: Pragmatic Bookshelf. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1806794
  • Ray Rankins, Paul Bertucci, Chris Gallelli, & Alex T. Silverstein. (2015). Microsoft SQL Server 2014 Unleashed. [N.p.]: Sams Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1601720
  • Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, & Shuen Mei. (2018). Apache Spark 2: Data Processing and Real-Time Analytics : Master Complex Big Data Processing, Stream Analytics, and Machine Learning with Apache Spark. Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1991793
  • Shrivastava, A., & Deshpande, T. (2016). Hadoop Blueprints. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1364692
  • Tae, K. H., Roh, Y., Oh, Y. H., Kim, H., & Whang, S. E. (2019). Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsarx&AN=edsarx.1904.10761