Master
2020/2021
Introduction to collection and analysis of 'Big data'
Category 'Best Course for Career Development'
Category 'Best Course for Broadening Horizons and Diversity of Knowledge and Skills'
Category 'Best Course for New Knowledge and Skills'
Type:
Compulsory course (Complex Social Analysis)
Area of studies:
Sociology
Delivered by:
School of Sociology
Where:
Faculty of Social Sciences
When:
1 year, 1, 2 module
Mode of studies:
distance learning
Open to:
students of all HSE University campuses
Instructors:
Alexander Byzov
Master’s programme:
Complex Social Analysis
Language:
English
ECTS credits:
4
Contact hours:
44
Course Syllabus
Abstract
This is an introductory course on gathering and analysis of Internet data. This course is oriented on two broad topics: data scraping and analysis of textual data. The course is taught in the form of trainings and practical work. All teaching is conducted in English. Within the course some R packages will be used for data analysis (it is freely available at https://www.r-project.org) This discipline is based on the following subjects: - Probability theory and Mathematical Statistics; - Methodology and Methods for Sociological Research. This discipline requires following knowledge and skills: - to know basic components of the sociological research; - to know various sampling techniques, their opportunities and limitations. Main ideas of the discipline might be applicable in following course: - Theory and Practice of Online Research. These online courses might be helpful in learning of the discipline: Shah C. Social Media Data Analytics. URL: https://www.coursera.org/learn/social-media-data-analytics (retrieved: 20.06.2018) Leek J., Peng R. D., Caffo B. Getting and Cleaning Data. URL: https://www.coursera.org/learn/data-cleaning (retrieved: 20.06.2018) Potapenko A., Zobnin A., Kozlova A., Yudin S., Zimovnov A. Natural Language Processing. URL: https://www.coursera.org/learn/language-processing (retrieved: 20.06.2018)
Learning Objectives
- Study of basic notions of Big data research
- Use of basic techniques to gather Big data and analyze it
Expected Learning Outcomes
- Know basic concepts of Big data, its opportunities, limitations, and relevance to social sciences
- Know basic concepts of R programming language
- Have skills to write R code for basic data analysis tasks
- Have skills to scrap online data through various API, automatization of actions in browser etc
- Have skills to analyze textual data
Course Contents
- Analysis of textual data in RBasic concepts of Text mining. Types of Text mining. Packages (qdap, stringi, stringr, tm, quanteda, NLP etc.). Text preprocessing. Term frequency analysis. Keywords analysis. Sentiment analysis. Topic analysis. Document clustering and classification. Introduction to advanced models (text2vec etc.). Visualization.
- Introduction to RWhat is R. Comparisons between R and SPSS, R and Stata, R and Python. Packages. Files. Variables. Data storage in R (vectors, lists, data frames etc.). Regular expressions. Conditions. Loops. Functions. Tidyverse in R. Limitations of R.
- Introduction to Big dataWhat is Big data. Different understandings of the notion, its opportunities and limitations. Big data applications in various types of social studies. Cases. Biases. Ethical concerns.
- Data scraping in RBasic information on web data (HTML, XML, HTTP, AJAX etc.). Data retrieval via APIs. Packages in R for social media's APIs (Twitter, Facebook, Vkontakte etc.). Limitations of APIs. Various scenarios for data retrieval without APIs. Packages in R for data retrieval without APIs (rvest, httr etc.). Automatization of actions in browser for scraping dynamic pages (with RSelenium package). Cleaning data.
Assessment Elements
- Class Attendance
- Class Participation
- Home assignment 1Each student must complete this home assignments individually. Students must hand over pdf file with answers and R script. Assessment is graded from 1 (fail) to 10 (excellent).
- Home assignment 2Each student must complete this home assignments individually. Students must hand over pdf file with answers and R script. Assessment is graded from 1 (fail) to 10 (excellent).
- EssayIn the essay a group of students (up to 4) should scrap and analyze online data from various sources on a chosen topic (for instance, news coverage of an event), and report it in a coherent text with introduction (research question, short literature review, and main hypotheses), main body (analysis), conclusion, list of references, and R script in appendix. The length of an essay should be at least 8000 characters without appendix
- Class Attendance
- Class Participation
- Home assignment 1Each student must complete this home assignments individually. Students must hand over pdf file with answers and R script. Assessment is graded from 1 (fail) to 10 (excellent).
- Home assignment 2Each student must complete this home assignments individually. Students must hand over pdf file with answers and R script. Assessment is graded from 1 (fail) to 10 (excellent).
- EssayIn the essay a group of students (up to 4) should scrap and analyze online data from various sources on a chosen topic (for instance, news coverage of an event), and report it in a coherent text with introduction (research question, short literature review, and main hypotheses), main body (analysis), conclusion, list of references, and R script in appendix. The length of an essay should be at least 8000 characters without appendix
Interim Assessment
- Interim assessment (2 module)0.12 * Class Attendance + 0.13 * Class Participation + 0.45 * Essay + 0.15 * Home assignment 1 + 0.15 * Home assignment 2
Bibliography
Recommended Core Bibliography
- Mayer-Schönberger, V., & Cukier, K. (2013). Big Data : A Revolution That Will Transform How We Live, Work, and Think. Boston: Eamon Dolan/Houghton Mifflin Harcourt. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1872664
- Роберт И., Кабаков - R в действии. Анализ и визуализация данных в программе R - Издательство "ДМК Пресс" - 2014 - 588с. - ISBN: 978-5-97060-077-1 - Текст электронный // ЭБС ЛАНЬ - URL: https://e.lanbook.com/book/58703
Recommended Additional Bibliography
- Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. New York, NY: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1175341