Profile Image

Tools & Techniques for Digital Humanities (2019W)


070112 UE Course in Methodology - Tools & Techniques for Digital Humanities (2019W) — University of Vienna, Department of History; Instructor Dr. Maxim G. Romanov


L10 Text to Map (1/2) — Extracting & Preparing Geographical Data

Goals:

To learn how to map quantifiable data, for example, mentions of different geographical locations in a given text. The first step will involve: learning about techniques for 1) extracting geographical data from texts; and about 2) preparing necessary data for mapping.

Examples of the final work:

Data needed

  • Placename - to add a label to a map
  • Lat,Lon - to place a dot on a map
  • Frequency - to size a dot on a map
  • TimeParameter - if we want to visualize change over time

Software & Technologies:

  • python (matching, reformatting, calculating frequencies)

Class:

Basic techniqies with for extracting geographical data usually include:

  1. HARD: NER (named-entity recognition): matching against a gazetteer (a dictionary of geographical names, or toponyms) to identify potential toponyms in a text; additional NLP (natural language processing) may be used to disambiguate matches.
  2. EASY-ish: Extracting toponyms from pre-tagged texts (our case)

Step 1. Extracting relevant data

Example:


<milestone unit="sentence" n="57" />It is my firm opinion,
substantiated by that of my family physician's, that
<persName n="Baker,,,,," id="n-0001.0000.00000.00014"
reg="nearbymention:Baker,E.,E.,," authname="baker,e.,e.">
<surname full="yes">Baker</surname></persName>'s Premium
Bitters is the best medicine now before the public for the
above mentioned diseases. </p> <closer> <salute>Yours,
most truly,</salute> <signed><persName n="Quarles,,P.,W.,J.,"
id="n-0001.0000.00000.00015" reg="default:Quarles,P.,W.,J.,"
authname="quarles,p.,w.,j."><foreName full="yes">P.</foreName>
<foreName full="yes">W.</foreName>  <foreName full="yes">J.</foreName>
<surname full="yes">Quarles</surname></persName>.</signed>
<lb /> These Bitters can be had of all the <name>Druggists</name>
in this city and at every respectable <placeName reg="Drug Store">Drug
Store</placeName> in the <rs>State</rs>. Orders filled promptly by
addressing <lb /> <persName n="Baker,,E.,,," id="n-0001.0000.00000.00016"
reg="default:Baker,E.,,," authname="baker,e."><foreName full="yes">
E.</foreName> <surname full="yes">Baker</surname></persName>,
Proprietor, <lb /> <placeName reg="Richmond, Richmond, Virginia"
key="tgn,7013964" authname="tgn,7013964">Richmond, Va.</placeName>
<lb /> oc <num value="30">30</num>--ts </closer> </div4>
</div3> 

<div3 type="advert" n="4" org="uniform" sample="complete">
<head><dateStruct value="1860--" full="yes" authname="1860">
<year reg="1860" full="yes">1860</year></dateStruct>.--Fall
and <rs n="winter goods" type="product">Winter Goods</rs>.
--<dateStruct value="1860--" full="yes" authname="1860">
<year reg="1860" full="yes">1860</year></dateStruct>.
<lb />Great Rush for Bargains.<lb />Tremendous Excitement.</head> 

Toward a Solution…

  • Our general script can be modified accordingly: instead of removing XML tags, we can extract relevant information and create another CSV file, of the following format: Placename, Frequency
  • Toponyms are described as <placeName reg="Richmond, Richmond, Virginia" key="tgn,7013964" authname="tgn,7013964">Richmond, Va.</placeName>
    • Not all <placeName ....> are the placenames that we need
    • We can use strings like authname="tgn,7013964" or key="tgn,7013964" to check if a result is a toponym; NB: try to figure out what tgn is and whether this can be helpful in any way.
    • What should go into Placename?
  • Frequencies can be calculated with a dictionary, where keys are Placename and values are Frequency.
dicFreq = {}

for i in listOfResults:
    # some additional code may be required here
    #   you might also want to filter out items of certain type:
    #   for example, exclude advertisement (which is likely to be local only)
    if i in dicFreq:
        dicFreq[i] += 1
    else:
        dicFreq[i]  = 1

  • Dictionary of frequencies can be then reformatted into a frequency list, which can be sorted by frequencies; one of the common techniques is usually to exclude low frequency values (for example those with frequencies 1; alternatively, relevant threshhold can be determined through a descriptive statistics method).

resultsCSV = []

for key, value in dicFreq.items():
    if value > 1: # this will exclude items with frequency 1
        newVal = "%09d\t%s" % (value, key)
        # newVal will looks like: `000005486 TAB Richmond`
        resultsCSV.append(newVal)

resultsCSV = sorted(resultsCSV, reverse=True)
print(len(resultsCSV)) # will print out the number of items in the list
resultsToSave = "\n".join(resultsCSV)

  • Additional: Try to include all items and check how the results will differ (in terms of the number of toponyms/placenames). You can use print(len(resultsCSV)) to get a quick count of results in the resultsCSV list.

NB: for the final map we will need also time parameters, so that we could visualize our data chronologically.

Problem

Although we can collect data, it is not yet mappable. We need to create relevant data that would allow us to place our frequency information on a map. We need coordinates!

Step 2. Data (coordinates)

The next step is to find all relevant coordinates. We can write a script that will match our data vs. the data from Geonames.org. This can be done as follows:

  • create a python dictionary (I) from our frequency data, where keys are placenames, and values—the entire TSV record.
  • create another python dictionary (II) from Geonames data, where keys are also placenames and values—the entire TSV record; there are likely to be multiple records with the same placenames—how would you deal with those? (Hint: you do not want to overwrite one with another; in order to avoid that, you can save values as lists, and append new records if there are more than one).
  • NB: you will need to consult http://download.geonames.org/export/dump/readme.txt in order to figure out what is what in that data file.
  • now, you can loop through python dictionary (I) and check if a key (I) exists in python dictionary (II); if yes, than you can add records to a new variable where you would collect only relevant toponyms; you can then save these selected records and manually re-check them.

Alternative ways of collecting coordinates

  • If you have a map of relevant area, you can georeference it and harvest coordinates from your georeferenced map; you can then also visualize your data on that very map, which might be a very nice touch visually.
  • As a last resort, you can always manually collect required coordinates, which, however, is the most time-consuming of doing this.

Reference Materials:

Homework (1/2 and 2/2):

  1. GISting the “Dispatch” II: Mapping geographical data from the “Dispatch”
    • Extract toponyms (place names) from the “Dispatch” (python)
    • Calculate their frequencies (python)
    • Add coordinates to those places (QGIS/python)
      • Hint: you can create two lists 1) one with place names and their frequencies, 2) with place names and their coordinates; after that, you can merge them with python and generate a CSV file (placename, timeParameter, latitude, longitude, frequency) which can be used for creating a map in QGIS
    • Map them in QGIS, using frequencies to size the markers on the map
  2. Describe the process in a blogpost and publish on your website (including screenshots).

Submitting homework:

  • Homework assignment must be submitted by the beginning of the next class;
  • Email your homework to the instructor.
    • if your homework is to create a file, email it as an attachment
    • if your homework is a blogpost on your website, email the link to your website and to the blogpost with your homework.
    • In the subject of your email, please, add the following: 070112-LXX-HW-YourLastName-YourMatriculationNumber, where LXX is the lesson for which the homework is submitted, YourLastName is your last name, and YourMatriculationNumber is your matriculation number.