Goals:
To learn how to map quantifiable data, for example, mentions of different geographical locations in a given text. The first step will involve: learning about techniques for 1) extracting geographical data from texts; and about 2) preparing necessary data for mapping.
Examples of the final work:
- https://maximromanov.github.io/2015/04-02.html
- https://maximromanov.github.io/2014/08-23.html (VIDEO)
Data needed
- Placename - to add a label to a map
- Lat,Lon - to place a dot on a map
- Frequency - to size a dot on a map
- TimeParameter - if we want to visualize change over time
Software & Technologies:
python
(matching, reformatting, calculating frequencies)
Class:
Basic techniqies with for extracting geographical data usually include:
- HARD: NER (named-entity recognition): matching against a gazetteer (a dictionary of geographical names, or toponyms) to identify potential toponyms in a text; additional NLP (natural language processing) may be used to disambiguate matches.
- EASY-ish: Extracting toponyms from pre-tagged texts (our case)
Step 1. Extracting relevant data
Example:
<milestone unit="sentence" n="57" />It is my firm opinion,
substantiated by that of my family physician's, that
<persName n="Baker,,,,," id="n-0001.0000.00000.00014"
reg="nearbymention:Baker,E.,E.,," authname="baker,e.,e.">
<surname full="yes">Baker</surname></persName>'s Premium
Bitters is the best medicine now before the public for the
above mentioned diseases. </p> <closer> <salute>Yours,
most truly,</salute> <signed><persName n="Quarles,,P.,W.,J.,"
id="n-0001.0000.00000.00015" reg="default:Quarles,P.,W.,J.,"
authname="quarles,p.,w.,j."><foreName full="yes">P.</foreName>
<foreName full="yes">W.</foreName> <foreName full="yes">J.</foreName>
<surname full="yes">Quarles</surname></persName>.</signed>
<lb /> These Bitters can be had of all the <name>Druggists</name>
in this city and at every respectable <placeName reg="Drug Store">Drug
Store</placeName> in the <rs>State</rs>. Orders filled promptly by
addressing <lb /> <persName n="Baker,,E.,,," id="n-0001.0000.00000.00016"
reg="default:Baker,E.,,," authname="baker,e."><foreName full="yes">
E.</foreName> <surname full="yes">Baker</surname></persName>,
Proprietor, <lb /> <placeName reg="Richmond, Richmond, Virginia"
key="tgn,7013964" authname="tgn,7013964">Richmond, Va.</placeName>
<lb /> oc <num value="30">30</num>--ts </closer> </div4>
</div3>
<div3 type="advert" n="4" org="uniform" sample="complete">
<head><dateStruct value="1860--" full="yes" authname="1860">
<year reg="1860" full="yes">1860</year></dateStruct>.--Fall
and <rs n="winter goods" type="product">Winter Goods</rs>.
--<dateStruct value="1860--" full="yes" authname="1860">
<year reg="1860" full="yes">1860</year></dateStruct>.
<lb />Great Rush for Bargains.<lb />Tremendous Excitement.</head>
Toward a Solution…
- Our general script can be modified accordingly: instead of removing XML tags, we can extract relevant information and create another CSV file, of the following format:
Placename, Frequency
- Toponyms are described as
<placeName reg="Richmond, Richmond, Virginia" key="tgn,7013964" authname="tgn,7013964">Richmond, Va.</placeName>
- Not all
<placeName ....>
are the placenames that we need - We can use strings like
authname="tgn,7013964"
orkey="tgn,7013964"
to check if a result is a toponym; NB: try to figure out whattgn
is and whether this can be helpful in any way. - What should go into
Placename
?
- Not all
- Frequencies can be calculated with a
dictionary
, where keys arePlacename
and values areFrequency
.
dicFreq = {}
for i in listOfResults:
# some additional code may be required here
# you might also want to filter out items of certain type:
# for example, exclude advertisement (which is likely to be local only)
if i in dicFreq:
dicFreq[i] += 1
else:
dicFreq[i] = 1
- Dictionary of frequencies can be then reformatted into a frequency list, which can be sorted by frequencies; one of the common techniques is usually to exclude low frequency values (for example those with frequencies 1; alternatively, relevant threshhold can be determined through a descriptive statistics method).
resultsCSV = []
for key, value in dicFreq.items():
if value > 1: # this will exclude items with frequency 1
newVal = "%09d\t%s" % (value, key)
# newVal will looks like: `000005486 TAB Richmond`
resultsCSV.append(newVal)
resultsCSV = sorted(resultsCSV, reverse=True)
print(len(resultsCSV)) # will print out the number of items in the list
resultsToSave = "\n".join(resultsCSV)
- Additional: Try to include all items and check how the results will differ (in terms of the number of toponyms/placenames). You can use
print(len(resultsCSV))
to get a quick count of results in theresultsCSV
list.
NB: for the final map we will need also time parameters, so that we could visualize our data chronologically.
Problem
Although we can collect data, it is not yet mappable. We need to create relevant data that would allow us to place our frequency information on a map. We need coordinates!
Step 2. Data (coordinates)
- Download
US.zip
from Geonames.org (http://download.geonames.org/export/dump/). - Important description of the data can be found here: http://download.geonames.org/export/dump/readme.txt
The next step is to find all relevant coordinates. We can write a script that will match our data vs. the data from Geonames.org. This can be done as follows:
- create a
python dictionary (I)
from our frequency data, wherekeys
are placenames, andvalues
—the entireTSV
record. - create another
python dictionary (II)
from Geonames data, wherekeys
are also placenames andvalues
—the entireTSV
record; there are likely to be multiple records with the same placenames—how would you deal with those? (Hint: you do not want to overwrite one with another; in order to avoid that, you can savevalues
aslists
, and append new records if there are more than one). - NB: you will need to consult http://download.geonames.org/export/dump/readme.txt in order to figure out what is what in that data file.
- now, you can loop through
python dictionary (I)
and check if akey (I)
exists inpython dictionary (II)
; if yes, than you can add records to a new variable where you would collect only relevant toponyms; you can then save these selected records and manually re-check them.
Alternative ways of collecting coordinates
- If you have a map of relevant area, you can georeference it and harvest coordinates from your georeferenced map; you can then also visualize your data on that very map, which might be a very nice touch visually.
- As a last resort, you can always manually collect required coordinates, which, however, is the most time-consuming of doing this.
Reference Materials:
- Frequency list: Turkel, William J., and Adam Crymble. 2012. “Counting Word Frequencies with Python.” Programming Historian, July. https://programminghistorian.org/lessons/counting-frequencies.
- Creating cartograms with R: https://maximromanov.github.io/2015/04-02.html
- D.Mimno, A. Jones, G. Crane. Finding a Catalog, https://www.academia.edu/158175/PDF
Homework (1/2 and 2/2):
- GISting the “Dispatch” II: Mapping geographical data from the “Dispatch”
- Extract toponyms (place names) from the “Dispatch” (python)
- Calculate their frequencies (python)
- Add coordinates to those places (QGIS/python)
- Hint: you can create two lists 1) one with place names and their frequencies, 2) with place names and their coordinates; after that, you can merge them with python and generate a CSV file (placename, timeParameter, latitude, longitude, frequency) which can be used for creating a map in QGIS
- Map them in QGIS, using frequencies to size the markers on the map
- Describe the process in a blogpost and publish on your website (including screenshots).
Submitting homework:
- Homework assignment must be submitted by the beginning of the next class;
- Email your homework to the instructor.
- if your homework is to create a file, email it as an attachment
- if your homework is a blogpost on your website, email the link to your website and to the blogpost with your homework.
- In the subject of your email, please, add the following:
070112-LXX-HW-YourLastName-YourMatriculationNumber
, whereLXX
is the lesson for which the homework is submitted,YourLastName
is your last name, andYourMatriculationNumber
is your matriculation number.