Master
2020/2021
Linguistic Data: Quantitative Analysis and Visualisation
Category 'Best Course for Broadening Horizons and Diversity of Knowledge and Skills'
Category 'Best Course for New Knowledge and Skills'
Type:
Compulsory course (Computational Linguistics)
Area of studies:
Fundamental and Applied Linguistics
Delivered by:
School of Linguistics
Where:
Faculty of Humanities
When:
1 year, 3, 4 module
Mode of studies:
offline
Open to:
students of one campus
Master’s programme:
Computational Linguistics
Language:
English
ECTS credits:
5
Contact hours:
64
Course Syllabus
Abstract
The course is devoted to modern methods of data analysis, as applied to linguistic data, including methods of statistical inference and explanatory data analysis with visualizations. We begin with theoretical background in mathematical statistics and discuss limitations of statistical methods and their applicability to linguistical problems. From practical point of view, we use R system to do actual analysis with real datasets. We also discuss different visualization techniques using popular library ggplot2.
Learning Objectives
- Within this course you will: ● learn about the principal steps of a quantitative research in linguistics; ● learn about the possibilities and limitations of quantitative approaches as applied to different research questions; ● learn to formulate research questions and develop them into testable hypotheses; ● explore the possibilities of data collection and different approaches to sampling; ● learn to evaluate the quality of a quantitative approach; ● study the most common corpus, experimental, and mixed design of the linguistic studies and learn to evaluate research plans, discover and prevent the associated threats to data validity; ● practice in preparing your quantitative data for analysis, evaluating the quality of your data; treating missing data; ● learn about the possibilities and limitations of conventional statistical techniques and criteria, as well as some popular contemporary multivariate statistical methods; ● learn to choose and apply in practice a set of appropriate statistical tests for your research question.
Expected Learning Outcomes
- Освоение базовых навыков работы в R
- Умение формулировать иссследовательский вопрос, формулировать и тестировать гипотезу / Learn to formulate research questions and develop them into testable hypotheses
- Знание базовых типов данных / Account for basic types of data used in linguistic research
- Теоретическое и практическое освоение подходов к анализу лингвистических данных, критическая оценка их применения / Learn and apply, critically discuss the limitations of commonly used methods for answering research questions about language
Course Contents
- Введение в R / Introduction to RТипы данных. Датафреймы. Функции и аргументы. Пайпы (dplyr). Визуализации: базовые и ggplot2. / Types of data. Dataframe. Functions and arguments. Dplyr style in R, pipes. Visualizing data: basic style and ggplot2.
- Дизайн исследования и формулирование гипотез. Дескриптивная статистика. Базовые визуализации. / Research design and Hypothesis testing. Descriptive statistics. Basic visualizations.Основные этапы проведения исследования. Тестирование гипотез. Типы распределений. Независимые и повторяющиеся наблюдения. p-values. Точный биномиальный тест, t-test, ANOVA. Доверительный интервал. Хи-квадрат и точный тест Фишера для категориальных данных. / Research design. Hypothesis testing. Types of distribution. P-values. Exact binomial test, t-test, ANOVA. Confidence intervals. Chi-squared and Fisher exact test.
- Корреляции и регрессионный анализ / Correlation and RegressionКорреляции. Линейная и полиномиальная регрессия. Логистическая регрессия / Correlation. Regressionsː linear and polynomial. Logistic regression.
- Смешанные модели / Mixed-effects modelsФиксированные и случайные эффекты. Смешанные модели. / Fixed and random effects. Mixed-effects models
- Бутстрап. Деревья решений. Случайный лес. / Bootstrap. Decision trees. Decision forestsБутстрап. Деревья решений. Случайный лес. / Bootstrap. Decision trees. Decision forests
- Кластеризация и снижение размерности / Clusterization and Dimension reductionDistance matrices. Clusterization. Dimension reduction, visualisations using MDS, PCA, CA, MCA.
- Байесовская статистика / Bayesian statisticsБайесовское правило. Обобщенные линейные модели. Сравнение и выбор модели / Bayes' rule for statistical inference. (Generalized) linear models. Model comparison and selection.
Assessment Elements
- Домашние задания
- Подготовка индивидуального исследовательского проекта
- Устный экзаменThe student is expected to prepare the final project in a written form as electronic document. The exam is conducted in the form of oral defense of the final project. The Exam Score measures the overall quality of the final project. It is integer number from 0 to 10. The Final Score is obtained from the following formula: Final Score = 0.6 × (Homework Score) + 0.4 × (Exam Score). Exam is conducted on Zoom platform, according to schedule that will be published using official telegram channel of the course. Link to Zoom meeting will be published there as well. First retake is conducted like exam. Second retake is conducted in form of written test. Connectivity problems are not considered as violations of the rules (if they allow to finish examination).
Interim Assessment
- Interim assessment (4 module)0.6 * Домашние задания + 0.4 * Подготовка индивидуального исследовательского проекта
Bibliography
Recommended Core Bibliography
- Gries, S. T. (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=604318
- Levshina, N. (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1093048
Recommended Additional Bibliography
- Gries, S. T. (2017). Quantitative Corpus Linguistics with R : A Practical Introduction (Vol. Second edition). Milton Park, Abingdon, Oxon: Routledge. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1386645
- Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. New York, NY: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1175341
- Harney, H. L. (2016). Bayesian Inference : Data Evaluation and Decisions (Vol. 2nd ed). Switzerland: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1301176
- McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. Boca Raton: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1338291