Goals:
Introduction into webscraping, or how one can efficiently collect lots of information from the Internet.
Software:
-
wget
(https://www.gnu.org/software/wget/), a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS the most widely-used Internet protocols. It is a non-interactive command line tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc. -
NB: on installing
wget
:- On Windows (the easiest): download from https://eternallybored.org/misc/wget/ > choose the latest 64-bit ZIP file (
EXE
will most likely be blocked by your browser as a potentially dangerous file).- Unzip the file and copy
wget.exe
to the folder where you are planning to scrape data; NB: the easiest approach on Windows is to move/copy this file into relevant folders.
- Unzip the file and copy
- On Mac (and, possibly, Linux):
brew install wget
- On Windows (the easiest): download from https://eternallybored.org/misc/wget/ > choose the latest 64-bit ZIP file (
Class:
- practical examples of working with
wget
- single link download
- batch download
- web-page analysis
- extraction of links with
regular expressions
- modification of links with
regular expressions
Sample commands
wget link
wget -i file_with_links.txt
wget -i links.txt -P ./folderYouWantToSaveTo/ -nc
Where:
-P
is a folder parameter, which instructswget
where you want to store downloaded files (optional).-nc
is a no-clobber parameter, which instructswget
to skips files, if they already exist (optional)
NB: there are many other parameters with which you can adjust wget
to your needs.
Examples for Downloading
Practice 1: very easy
- Article 01
- Article 02
- Article 03
- Article 04
- Article 05
- Article 06
- Article 07
- Article 08
- Article 09
- Article 10
- Article 11
- Article 12
- Article 13
- Article 14
- Article 15
Practice 2: easy-ish
- Article 16
- Article 17
- Article 18
- Article 19
- Article 20
- Article 21
- Article 22
- Article 23
- Article 24
- Article 25
- Article 26
- Article 27
- Article 28
- Article 29
- Article 30
- Article 31
- Article 32
- Article 33
- Article 34
- Article 35
- Article 36
- Article 37
- Article 38
- Article 39
Practice 3 (aka Homework): a tiny-bit tricky
- download issues of “Richmond Times Dispatch” (Years 1860-1865, only!), which are available at: http://www.perseus.tufts.edu/hopper/collection?collection=Perseus:collection:RichTimes)
Reference Materials:
- Milligan, Ian. 2012. “Automated Downloading with Wget.” Programming Historian, June. https://programminghistorian.org/lessons/automated-downloading-with-wget.
- Kurschinski, Kellen. 2013. “Applied Archival Downloading with Wget.” Programming Historian, September. https://programminghistorian.org/lessons/applied-archival-downloading-with-wget.
- Baxter, Richard. 2019. “How to download your website using WGET for Windows.” https://builtvisible.com/download-your-website-with-wget/.
- Alternatively, this operation can be done with a Python script: Turkel, William J., and Adam Crymble. 2012. “Downloading Web Pages with Python.” Programming Historian, July. https://programminghistorian.org/lessons/working-with-web-pages.
Homework:
- Scraping the “Dispatch”: download issues of “Richmond Times Dispatch” (Years 1860-1865, only!), which are available at: http://www.perseus.tufts.edu/hopper/collection?collection=Perseus:collection:RichTimes)
- Publish a step-by-step explanation of what you have done as a blogpost on your website.
- Codecademy’s Learn Python, Unit 4-5.
- Github: publish the confirmation screenshot as a post on your new site.
Submitting homework:
- Homework assignment must be submitted by the beginning of the next class;
- Email your homework to the instructor.
- if your homework is to create a file, email it as an attachment
- if your homework is a blogpost on your website, email the link to your website and to the blogpost with your homework.
- In the subject of your email, please, add the following:
070112-LXX-HW-YourLastName-YourMatriculationNumber
, whereLXX
is the lesson for which the homework is submitted,YourLastName
is your last name, andYourMatriculationNumber
is your matriculation number.