Profile Image

Tools & Techniques for Digital Humanities (2019W)


070112 UE Course in Methodology - Tools & Techniques for Digital Humanities (2019W) — University of Vienna, Department of History; Instructor Dr. Maxim G. Romanov


L12 Topic Modeling — with python & jupyter notebooks

Goals:

  • Introduction to topic modeling, or how to classify texts by shared content (“topics”).

Software

  • python
  • jupyter notebook
  • other python libraries
    • nltk
    • gensim
    • spacy
    • pyLDAvis
    • matplotlib
    • numpy
    • pandas
    • plotly
    • pprint

Workbooks (jupyter notebooks)

Installing: on Windows

On Mac and Linux things are easy, just follow the commands below; for Windows things are trickier and the easiest way would be to use Anaconda https://www.anaconda.com/distribution/#download-section.

Please, download and install. Most packages will come with Anaconda distribution; others you can install through its interface.

NB: After Anaconda is installed, it is still better to install libraries from the terminal opened directly from Anaconda and using the following command conda install -c conda-forge gensim (the latest version is not available via Anaconda interface). More details: https://radimrehurek.com/gensim/install.html

Installing: on Mac and Linux

python libraries and additional data


pip install nameOfLibrary

Lemmatization library (although we are not going to be using it in the tutorial)

python -m spacy download en

jupyter notebook

From command line (in your working folder)


# installing
pip install jupyter

# starting
jupyter notebook

installing from a jupyter notebook

The required libraries can also be installed directly from your Jupyter notebook as shown below—note ! in fron of pip. (Note: You might need to use pip3 instead of pip, depending on your overall python setup)

# installing

!pip install nltk
!pip install gensim
!pip install spacy
!pip install pyLDAvis

Your default browser should open something like this:

Click on an *.ipynb file to open a notebook.

Files & Scripts

Class

  • Basic explanations
  • Hands-on tutorial

Topics

Example 1

Thursday, March 27, 1862

LIGHT ARTILLERY

—I am authorized by the Governor of Virginia to raise a Company of Light Artillery for the war. All those desirous of enlisting in this the most effective arm of the service, would do well to call at once at the office of Johnson & Guigon, Whig Building.

Uniforms and subsistence furnished.

A. B. GUIGON. mh 25—6t

Example 2

Wednesday, August 17, 1864

Royal Marriages.

—There is a story circulated in Germany, and some in Paris, that the match between the heir-apparent of the Imperial throne of Russia and the Princess Dagmar of Denmark having been definitively broken off, another is in the course of negotiation between His Imperial Highness and the Princess Helens of England.

Example 3

Monday, June 22, 1863

NEWS FROM EUROPE.

The steamship Scotia arrived at New York on Thursday from Europe, with foreign news to the 7th inst. The news is not important. The Confederate steamer Lord Clyde was searched by order of the British Government, but nothing contraband being found on board her she was permitted to sail. The Russians have been defeated near Grochoury by the Polish insurgents. The three Powers have sent an earnest note to Russia, asking for a representative Government, a general amnesty, and an immediate cessation of hostilities in Poland.

Reference Materials:

Homework:

  1. Topic modeling the “Dispatch”: using the provided jupyter notebook, run topic modeling on the “Dispatch”.
  2. Change the number of topics to 30 and compare new results with the results for 40 topics. Record your observations.
  3. Publish your observations as a blogpost on your website; compare your results with those of Rob Nelson’s Mining the Dispatch (http://dsl.richmond.edu/dispatch/).

Submitting homework:

  • Homework assignment must be submitted by the beginning of the next class;
  • Email your homework to the instructor.
    • if your homework is to create a file, email it as an attachment
    • if your homework is a blogpost on your website, email the link to your website and to the blogpost with your homework.
    • In the subject of your email, please, add the following: 070112-LXX-HW-YourLastName-YourMatriculationNumber, where LXX is the lesson for which the homework is submitted, YourLastName is your last name, and YourMatriculationNumber is your matriculation number.