Магистратура
2022/2023
Научно-исследовательский семинар "Извлечение данных"
Статус:
Курс обязательный (Магистр по компьютерному зрению)
Направление:
01.04.02. Прикладная математика и информатика
Когда читается:
2-й курс, 4 модуль
Формат изучения:
с онлайн-курсом
Онлайн-часы:
90
Охват аудитории:
для своего кампуса
Преподаватели:
Савченко Андрей Владимирович
Прогр. обучения:
Магистр по компьютерному зрению
Язык:
английский
Кредиты:
6
Контактные часы:
6
Course Syllabus
Abstract
Data Scraping is importing information from a website, spreadsheets, PDF's and other data sources. Using machine learning methods without a well-prepared dataset will not lead to good results. Qualitatively prepared datasets suitable for machine learning algorithms are a rarity. Automating the preparation of such data sets is the task of data scraping. The course examines the issues of text file encoding, network interaction with web servers, the basics of the HTML hypertext markup language, XML and JSON data storage and exchange formats, interaction with servers using the API, and work with non-static sites. The course uses Python and its libraries to access data. At the end of the course, students will implement a data scraping project.
Learning Objectives
- Learn to process excel/xml/json/pdf files using Python
- Learn ip, dns, http. GET- and POST- requests
- Learn HTML basics
- Learn to implement BeautifulSoup library, automatization with Selenium
- Learn to use API's
Expected Learning Outcomes
- Learn most popular encodings
- Change encoding of a text from one to another
- Navigate through JSON & XML
- Extract text and images from PDF
- Apply regular expressions
- Understand HTML
- Create a simple HTML-page
- Understand CSS
- Analyze the connection between HTML and CSS
- Create a more complicated HTML page
- Apply CSS to add style to HTML page
- Analyze HTTP protocol message format
- Learn about Python Web-Tools
- Apply Python requests module
- Apply Python requests module to deal with headers, user-sessions, POST-requests, files
- Apply Python BeautifulSoup module to scrape static pages
- Analyze the difference between static and dynamic pages
- Understand Silenium library capabilities, its functions and methods
- Apply Silenium library to scrape data from a dynamic page
- Recognize the concept of Web-API
- Contrast the process of scraping via Web-API and via page source
- Examine the process of web-development
- Create your own simple web-service & web-API
- Implement a scraping script from scratch
- Understand legal & ethical nuances of data scraping
Course Contents
- 1. Character Encodings
- 2. Popular File Formats
- 3. Regular Expressions and HTML
- 4. HTML and CSS
- 5. Internet
- 6. Scraping HTML
- 7. Selenium
- 8. Web API
- 9. Web development 101
- 10. Practice
Bibliography
Recommended Core Bibliography
- Matt West and Matt West - HTML5 Foundations - John Wiley & Sons, Incorporated , 2012-386 - Текст электронный - https://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=1120310
Recommended Additional Bibliography
- Ian Pouncey and Richard York - Beginning CSS : Cascading Style Sheets for Web Design - John Wiley & Sons, Incorporated, 2011-466 - Текст электронный - https://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=693510