Profile Image

Tools & Techniques for Digital Humanities (2019W)


070112 UE Course in Methodology - Tools & Techniques for Digital Humanities (2019W) — University of Vienna, Department of History; Instructor Dr. Maxim G. Romanov


L06 Webscraping — Getting to know WGET

Goals:

Introduction into webscraping, or how one can efficiently collect lots of information from the Internet.

Software:

  • wget (https://www.gnu.org/software/wget/), a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS the most widely-used Internet protocols. It is a non-interactive command line tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

  • NB: on installing wget:

    • On Windows (the easiest): download from https://eternallybored.org/misc/wget/ > choose the latest 64-bit ZIP file (EXE will most likely be blocked by your browser as a potentially dangerous file).
      • Unzip the file and copy wget.exe to the folder where you are planning to scrape data; NB: the easiest approach on Windows is to move/copy this file into relevant folders.
    • On Mac (and, possibly, Linux): brew install wget

Class:

  • practical examples of working with wget
  • single link download
  • batch download
    • web-page analysis
    • extraction of links with regular expressions
    • modification of links with regular expressions

Sample commands

wget link
wget -i file_with_links.txt
wget -i links.txt -P ./folderYouWantToSaveTo/ -nc 

Where:

  • -P is a folder parameter, which instructs wget where you want to store downloaded files (optional).
  • -nc is a no-clobber parameter, which instructs wget to skips files, if they already exist (optional)

NB: there are many other parameters with which you can adjust wget to your needs.

Examples for Downloading

Practice 1: very easy

Practice 2: easy-ish

Practice 3 (aka Homework): a tiny-bit tricky

Reference Materials:

Homework:

  1. Scraping the “Dispatch”: download issues of “Richmond Times Dispatch” (Years 1860-1865, only!), which are available at: http://www.perseus.tufts.edu/hopper/collection?collection=Perseus:collection:RichTimes)
  2. Publish a step-by-step explanation of what you have done as a blogpost on your website.
  3. Codecademy’s Learn Python, Unit 4-5.
  4. Github: publish the confirmation screenshot as a post on your new site.

Submitting homework:

  • Homework assignment must be submitted by the beginning of the next class;
  • Email your homework to the instructor.
    • if your homework is to create a file, email it as an attachment
    • if your homework is a blogpost on your website, email the link to your website and to the blogpost with your homework.
    • In the subject of your email, please, add the following: 070112-LXX-HW-YourLastName-YourMatriculationNumber, where LXX is the lesson for which the homework is submitted, YourLastName is your last name, and YourMatriculationNumber is your matriculation number.