Goals:
- To prepare data for mapping (final step)
- Do simple mapping in
QGIS
Solutions to Text to Map 1/1
Collecting all toponyms: using TGN numbers
import re, os
source = "path_where_initial_xml_files_are"
target = "path_to_save_new_files"
lof = os.listdir(source)
counter = 0 # general counter to keep track of the progress
def generate(filter):
topCountDictionary = {}
print(filter)
counter = 0
for f in lof:
if f.startswith("dltext"): # fileName test
with open(source + f, "r", encoding="utf8") as f1:
text = f1.read()
text = text.replace("&", "&")
# try to find the date
date = re.search(r'<date value="([\d-]+)"', text).group(1)
#print(date)
if date.startswith(filter):
for tg in re.findall(r"(tgn,\d+)", text):
tgn = tg.split(",")[1]
if tgn in topCountDictionary:
topCountDictionary[tgn] += 1
else:
topCountDictionary[tgn] = 1
#input(topCountDictionary)
top_TSV = []
for k,v in topCountDictionary.items():
val = "%09d\t%s" % (int(v/2), k)
# `tgn,XXXXX` occurs twice for every tagged toponym
# NB: unfortunately, there are some toponyms/placenames that are tagged,
# but do not have tgn identifiers --- these will be missed;
# # how can these be accounted for?
top_TSV.append(val)
# saving
header = "freq\ttgn\n"
with open("dispatch_toponyms_%s.tsv" % filter, "w", encoding="utf8") as f9:
f9.write(header+"\n".join(top_TSV))
#print(counter)
#generate("186") # - this will aggregate all data
generate("1861")
generate("1862")
generate("1863")
generate("1864")
generate("1865")
Question to ponder: How can this data be improved? (Not the quality of the initial data, but the structure of what you reformat.)
Reformating Getty Geographical Thesaurus (from XML)
NB: All files can be downloaded from here: http://tgndownloads.getty.edu/default.aspx.
Example of an XML record:
<Subject Subject_ID="7013964">
<Parent_Relationships>
<Preferred_Parent>
<Parent_Subject_ID>7022211</Parent_Subject_ID>
<Relationship_Type>Parent/Child</Relationship_Type>
<Historic_Flag>Current</Historic_Flag>
<Parent_String>Richmond Indep. City (independent city) [7022211], Virginia (state) [7007919], United States (nation) [7012149], North and Central America (continent) [1000001], World (facet) [7029392]</Parent_String>
<Hier_Rel_Type>Whole/Part-BTP</Hier_Rel_Type>
</Preferred_Parent>
</Parent_Relationships>
<Descriptive_Notes>
<Descriptive_Note>
<Note_Text>At head of James River; long impotant as tobacco center; site of trading post 1637 & Fort Charles 1644; pillaged by British during Revolutionary War & by Union forces during Civil War; State Capitol building (begun in 1795) designed by Thomas Jefferson.</Note_Text>
<Note_Language>English</Note_Language>
<Note_Contributors>
<Note_Contributor>
<Contributor_id>10000000/VP</Contributor_id>
</Note_Contributor>
</Note_Contributors>
<Note_Sources>
<Note_Source>
<Source>
<Source_ID>9006382/Encyclopaedia Britannica (1985)</Source_ID>
</Source>
</Note_Source>
</Note_Sources>
</Descriptive_Note>
</Descriptive_Notes>
<Record_Type>Administrative</Record_Type>
<Merged_Status>Merged</Merged_Status>
<Terms>
<Preferred_Term>
<Term_Text>Richmond</Term_Text>
<Display_Name>N/A</Display_Name>
<Historic_Flag>Current</Historic_Flag>
<Vernacular>Vernacular</Vernacular>
<Other_Flags>N/A</Other_Flags>
<Term_ID>298597</Term_ID>
<Term_Languages>
<Term_Language>
<Language>70001/undetermined</Language>
<Preferred>Non Preferred</Preferred>
<Qualifier>
</Qualifier>
<Term_Type>N/A</Term_Type>
<Part_of_Speech>Noun</Part_of_Speech>
<Lang_Stat>Undetermined</Lang_Stat>
</Term_Language>
</Term_Languages>
<Term_Contributors>
<Term_Contributor>
<Contributor_id>10000000/VP</Contributor_id>
<Preferred>Preferred</Preferred>
</Term_Contributor>
<Term_Contributor>
<Contributor_id>10000001/BHA</Contributor_id>
<Preferred>Non Preferred</Preferred>
</Term_Contributor>
<Term_Contributor>
<Contributor_id>10000002/FDA</Contributor_id>
<Preferred>Non Preferred</Preferred>
</Term_Contributor>
<Term_Contributor>
<Contributor_id>10000003/GRLPSC</Contributor_id>
<Preferred>Non Preferred</Preferred>
</Term_Contributor>
</Term_Contributors>
<Term_Sources>
<Term_Source>
<Source>
<Source_ID>2009007637/National Archives and Record Administration database (1987-)</Source_ID>
</Source>
<Page>
</Page>
<Preferred>Non Preferred</Preferred>
</Term_Source>
<Term_Source>
<Source>
<Source_ID>2009007642/Library of Congress HABS/HAER (1987-)</Source_ID>
</Source>
<Page>
</Page>
<Preferred>Non Preferred</Preferred>
</Term_Source>
<Term_Source>
<Source>
<Source_ID>2009008541/Encyclopaedia Britannica (1988)</Source_ID>
</Source>
<Page>10:54</Page>
<Preferred>Unknown</Preferred>
</Term_Source>
<Term_Source>
<Source>
<Source_ID>9006447/Canby, Historic Places (1984)</Source_ID>
</Source>
<Page>2:779</Page>
<Preferred>Unknown</Preferred>
</Term_Source>
<Term_Source>
<Source>
<Source_ID>9006449/Webster's Geographical Dictionary (1984)</Source_ID>
</Source>
<Page>
</Page>
<Preferred>Unknown</Preferred>
</Term_Source>
<Term_Source>
<Source>
<Source_ID>9006553/USGS, GNIS Digital Gazetteer (1994)</Source_ID>
</Source>
<Page>GNIS51022789</Page>
<Preferred>Unknown</Preferred>
</Term_Source>
</Term_Sources>
</Preferred_Term>
</Terms>
<Associative_Relationships>
<Associative_Relationship>
<Relationship_Type>capital of</Relationship_Type>
<Related_Subject_ID>
<VP_Subject_ID>7007919</VP_Subject_ID>
</Related_Subject_ID>
<Historic_Flag>Current</Historic_Flag>
</Associative_Relationship>
</Associative_Relationships>
<Subject_Contributors>
<Subject_Contributor>
<Contributor_id>10000000/VP</Contributor_id>
</Subject_Contributor>
<Subject_Contributor>
<Contributor_id>10000001/BHA</Contributor_id>
</Subject_Contributor>
<Subject_Contributor>
<Contributor_id>10000002/FDA</Contributor_id>
</Subject_Contributor>
<Subject_Contributor>
<Contributor_id>10000003/GRLPSC</Contributor_id>
</Subject_Contributor>
</Subject_Contributors>
<Subject_Sources>
<Subject_Source>
<Source>
<Source_ID>2009008541/Encyclopaedia Britannica (1988)</Source_ID>
</Source>
</Subject_Source>
<Subject_Source>
<Source>
<Source_ID>9006447/Canby, Historic Places (1984)</Source_ID>
</Source>
</Subject_Source>
<Subject_Source>
<Source>
<Source_ID>9006449/Webster's Geographical Dictionary (1984)</Source_ID>
</Source>
</Subject_Source>
<Subject_Source>
<Source>
<Source_ID>9006553/USGS, GNIS Digital Gazetteer (1994)</Source_ID>
</Source>
</Subject_Source>
<Subject_Source>
<Source>
<Source_ID>2009007637/National Archives and Record Administration database (1987-)</Source_ID>
</Source>
</Subject_Source>
<Subject_Source>
<Source>
<Source_ID>2009007642/Library of Congress HABS/HAER (1987-)</Source_ID>
</Source>
</Subject_Source>
</Subject_Sources>
<Place_Types>
<Preferred_Place_Type>
<Place_Type_ID>1</Place_Type_ID>
<Historic_Flag>Current</Historic_Flag>
<PT_Date>
<Display_Date>site explored in 1697 by settlers led by Christopher Newport & John Smith, shortly after the founding of Jamestown</Display_Date>
<Start_Date>1607</Start_Date>
<End_Date>9999</End_Date>
</PT_Date>
</Preferred_Place_Type>
<Non-Preferred_Place_Type>
<Place_Type_ID>1</Place_Type_ID>
<Historic_Flag>Current</Historic_Flag>
<PT_Date>
<Display_Date>since 1782</Display_Date>
<Start_Date>1782</Start_Date>
<End_Date>9999</End_Date>
</PT_Date>
</Non-Preferred_Place_Type>
<Non-Preferred_Place_Type>
<Place_Type_ID>1</Place_Type_ID>
<Historic_Flag>Current</Historic_Flag>
<PT_Date>
<Display_Date>since 1779</Display_Date>
<Start_Date>1779</Start_Date>
<End_Date>9999</End_Date>
</PT_Date>
</Non-Preferred_Place_Type>
<Non-Preferred_Place_Type>
<Place_Type_ID>1</Place_Type_ID>
<Historic_Flag>Current</Historic_Flag>
<PT_Date>
<Display_Date>of Henrico county</Display_Date>
<Start_Date>1800</Start_Date>
<End_Date>9999</End_Date>
</PT_Date>
</Non-Preferred_Place_Type>
<Non-Preferred_Place_Type>
<Place_Type_ID>1</Place_Type_ID>
<Historic_Flag>Current</Historic_Flag>
</Non-Preferred_Place_Type>
<Non-Preferred_Place_Type>
<Place_Type_ID>1</Place_Type_ID>
<Historic_Flag>Current</Historic_Flag>
<PT_Date>
<Display_Date>produces chemicals, textiles, pharmaceuticals, wood & paper products</Display_Date>
<Start_Date>1800</Start_Date>
<End_Date>9999</End_Date>
</PT_Date>
</Non-Preferred_Place_Type>
<Non-Preferred_Place_Type>
<Place_Type_ID>1</Place_Type_ID>
<Historic_Flag>Current</Historic_Flag>
<PT_Date>
<Display_Date>site of Confederate White House, Robert E Lee House & St. John's Church, scene of Patrick Henry's "Liberty or Death " speech</Display_Date>
<Start_Date>1900</Start_Date>
<End_Date>9999</End_Date>
</PT_Date>
</Non-Preferred_Place_Type>
<Non-Preferred_Place_Type>
<Place_Type_ID>1</Place_Type_ID>
<Historic_Flag>Current</Historic_Flag>
</Non-Preferred_Place_Type>
<Non-Preferred_Place_Type>
<Place_Type_ID>1</Place_Type_ID>
<Historic_Flag>Historical</Historic_Flag>
<PT_Date>
<Display_Date>of the Confederacy during the Civil War</Display_Date>
<Start_Date>1860</Start_Date>
<End_Date>1865</End_Date>
</PT_Date>
</Non-Preferred_Place_Type>
<Non-Preferred_Place_Type>
<Place_Type_ID>1</Place_Type_ID>
<Historic_Flag>Historical</Historic_Flag>
<PT_Date>
<Display_Date>city was burned on April 3, 1865 during Civil War</Display_Date>
<Start_Date>1865</Start_Date>
<End_Date>1865</End_Date>
</PT_Date>
</Non-Preferred_Place_Type>
</Place_Types>
<Coordinates>
<Standard>
<Latitude>
<Degrees>37</Degrees>
<Minutes>33</Minutes>
<Seconds>0</Seconds>
<Direction>North</Direction>
<Decimal>38</Decimal>
</Latitude>
<Longitude>
<Degrees>77</Degrees>
<Minutes>27</Minutes>
<Seconds>0</Seconds>
<Direction>West</Direction>
<Decimal>-77</Decimal>
</Longitude>
</Standard>
</Coordinates>
</Subject>
import re, os
source = "path_to_the_folder_with_TGN_xml_files"
def generateTGNdata(source):
lof = os.listdir(source)
tgnList = []
tgnListNA = []
count = 0
for f in lof:
if f.startswith("TGN"): # fileName test
print(f)
with open(source+f, "r", encoding="utf8") as f1:
data = f1.read()
data = re.split("</Subject>", data)
for d in data:
d = re.sub("\n +", "", d)
#print(d)
if "Subject_ID" in d:
# SUBJECT ID
placeID = re.search(r"Subject_ID=\"(\d+)\"", d).group(1)
#print(placeID)
# NAME OF THE PLACE
placeName = re.search(r"<Term_Text>([^<]+)</Term_Text>", d).group(1)
#print(placeName)
# COORDINATES
if "<Coordinates>" in d:
latGr = re.search(r"<Latitude>(.*)</Latitude>", d).group(1)
lat = re.search(r"<Decimal>(.*)</Decimal>", latGr).group(1)
lonGr = re.search(r"<Longitude>(.*)</Longitude>", d).group(1)
lon = re.search(r"<Decimal>(.*)</Decimal>", lonGr).group(1)
#print(lat)
#print(lon)
else:
lat = "NA"
lon = "NA"
tgnList.append("\t".join([placeID, placeName, lat, lon]))
#input(tgnList)
if lat == "NA":
print("\t"+ "; ".join([placeID, placeName, lat, lon]))
tgnListNA.append("\t".join([placeID, placeName, lat, lon]))
# saving
header = "tgnID\tplacename\tlat\tlon\n"
with open("tgn_data_light.tsv", "w", encoding="utf8") as f9a:
f9a.write(header+"\n".join(tgnList))
with open("tgn_data_light_NA.tsv", "w", encoding="utf8") as f9b:
f9b.write(header+"\n".join(tgnListNA))
print("TGN has %d items" % len(tgnList))
generateTGNdata(source)
#TGN has 2,487,572 items
# 17,613 items do not have coordinates.
Matching collected geographical data with Getty Gazetteer
First we want to load TGN fdata into a dictionary:
import re, os
def loadTGN(tgnTSV):
with open(tgnTSV, "r", encoding="utf8") as f1:
data = f1.read().split("\n")
dic = {}
for d in data:
d = d.split("\t")
dic[d[0]] = d
return(dic)
Then we can start matching:
def match(freqFile, dicToMatch):
with open(freqFile, "r", encoding="utf8") as f1:
data = f1.read().split("\n")
dataNew = []
dataNewNA = []
count = 0
for d in data[1:]:
tgnID = d.split("\t")[1]
freq = d.split("\t")[0]
if tgnID in dicToMatch:
val = "\t".join(dicToMatch[tgnID])
val = val + "\t" + freq
if "\tNA\t" in val:
dataNewNA.append(val)
else:
dataNew.append(val)
else:
print("%s (%d) not in TGN!" % (tgnID, int(freq)))
count += 1
header = "tgnID\tplacename\tlat\tlon\tfreq\n"
with open("coord_"+freqFile, "w", encoding="utf8") as f9a:
f9a.write(header + "\n".join(dataNew))
with open("coord_NA_"+freqFile, "w", encoding="utf8") as f9b:
f9b.write(header + "\n".join(dataNewNA))
print("%d item have not been matched..." % count)
Running these two functions will look like:
dictionary = loadTGN("tgn_data_light.tsv")
match("dispatch_toponyms_1861.tsv", dictionary)
match("dispatch_toponyms_1862.tsv", dictionary)
match("dispatch_toponyms_1863.tsv", dictionary)
match("dispatch_toponyms_1864.tsv", dictionary)
match("dispatch_toponyms_1865.tsv", dictionary)
Generated files now can be loaded into QGIS as text-delimited layers, although there is a bit of a problem with TGN, which will become apparent soon.
Maps in QGIS
NB: In general, the following brief instructions should suffice; if you are confused, google it, or/and ask your comrades for help. In class, you can use this file.
- In QGIS:
- Layer >
- Add Layer >
- Add Delimited Text Layer (need to point which columns are coordinates!)
- If your data appears in a small area of the map, try to change projection
- Style your layer (size of circles, transparency, labels, etc.)
- Sizing dots:
- Right-click on your layer > Properties > Symbology >
- From the drop down menu on top, select Graduated (Default: Single symbol)
- In Column, select your column with frequencies
- Method > choose, Size and adjust the size of the smallest and the largest dot (In Layer rendering you can adjust transparency of your dots).
- In Classes, make sure to select Equal Interval (although feel free to play with other options as well) > Click Classify (you can change the number of classes and apply Symmetric Classification) > click Apply
- Experiment with other options. If you get stuck, simply start over.
- From the drop down menu on top, select Graduated (Default: Single symbol)
- Right-click on your layer > Properties > Symbology >
- In QGIS:
- Use
QuickMapServices (QMS)
plugin to add a map layer quickly.
- Use
- In QGIS:
- You can use
TimeManager
plugin to animate your maps over time.
- You can use
Results
Reference Materials:
- Frequency list: Turkel, William J., and Adam Crymble. 2012. “Counting Word Frequencies with Python.” Programming Historian, July. https://programminghistorian.org/lessons/counting-frequencies.
- Creating cartograms with R: https://maximromanov.github.io/2015/04-02.html
Homework (1/2 and 2/2):
- GISting the “Dispatch” II: Mapping geographical data from the “Dispatch”
- Extract toponyms (place names) from the “Dispatch” (python)
- Calculate their frequencies (python)
- Generate files for all years, all months, and all days (you can save results in subfolders, but, better, in single files with dates in another column)
- Add coordinates to those places (QGIS/python)
- Hint: you can create two lists 1) one with place names and their frequencies, 2) with place names and their coordinates; after that, you can merge them with python and generate a CSV file (placename, timeParameter, latitude, longitude, frequency) which can be used for creating a map in QGIS
- Map them in QGIS, using frequencies to size the markers on the map
- Describe the process in a blogpost and publish on your website (including screenshots).
- since you are provided with almost complete solutions here, you should do the following:
- describe every line of code (or a block of code)
- provide alternative code (to the whole or parts), explain why you think yours is better
- try to find errors, inefficiencies, or/and flaws in the offered solutions. Explain—this will give you extra points.
- since you are provided with almost complete solutions here, you should do the following:
Submitting homework:
- Homework assignment must be submitted by the beginning of the next class;
- Email your homework to the instructor.
- if your homework is to create a file, email it as an attachment
- if your homework is a blogpost on your website, email the link to your website and to the blogpost with your homework.
- In the subject of your email, please, add the following:
070112-LXX-HW-YourLastName-YourMatriculationNumber
, whereLXX
is the lesson for which the homework is submitted,YourLastName
is your last name, andYourMatriculationNumber
is your matriculation number.