Profile Image

Tools & Techniques for Digital Humanities (2019W)


070112 UE Course in Methodology - Tools & Techniques for Digital Humanities (2019W) — University of Vienna, Department of History; Instructor Dr. Maxim G. Romanov


L11 Text to Map (2/2) — Merging & Mapping Data

Goals:

  1. To prepare data for mapping (final step)
  2. Do simple mapping in QGIS

Solutions to Text to Map 1/1

Collecting all toponyms: using TGN numbers

import re, os

source = "path_where_initial_xml_files_are"
target = "path_to_save_new_files"


lof = os.listdir(source)
counter = 0 # general counter to keep track of the progress

def generate(filter):

    topCountDictionary = {}

    print(filter)
    counter = 0
    for f in lof:
        if f.startswith("dltext"): # fileName test        
            with open(source + f, "r", encoding="utf8") as f1:
                text = f1.read()

                text = text.replace("&", "&")

                # try to find the date
                date = re.search(r'<date value="([\d-]+)"', text).group(1)
                #print(date)

                if date.startswith(filter):
                    for tg in re.findall(r"(tgn,\d+)", text):
                        tgn = tg.split(",")[1]

                        if tgn in topCountDictionary:
                            topCountDictionary[tgn] += 1
                        else:
                            topCountDictionary[tgn]  = 1

                        #input(topCountDictionary)
                    
    top_TSV = []

    for k,v in topCountDictionary.items():
        val = "%09d\t%s" % (int(v/2), k)
        # `tgn,XXXXX` occurs twice for every tagged toponym
        # NB: unfortunately, there are some toponyms/placenames that are tagged,
        #     but do not have tgn identifiers --- these will be missed;
        #     # how can these be accounted for? 

        top_TSV.append(val)                    

    # saving
    header = "freq\ttgn\n"
    with open("dispatch_toponyms_%s.tsv" % filter, "w", encoding="utf8") as f9:
        f9.write(header+"\n".join(top_TSV))
    #print(counter)

#generate("186") # - this will aggregate all data

generate("1861")
generate("1862")
generate("1863")
generate("1864")
generate("1865")

Question to ponder: How can this data be improved? (Not the quality of the initial data, but the structure of what you reformat.)

Reformating Getty Geographical Thesaurus (from XML)

NB: All files can be downloaded from here: http://tgndownloads.getty.edu/default.aspx.

Example of an XML record:


<Subject Subject_ID="7013964">
    <Parent_Relationships>
      <Preferred_Parent>
        <Parent_Subject_ID>7022211</Parent_Subject_ID>
        <Relationship_Type>Parent/Child</Relationship_Type>
        <Historic_Flag>Current</Historic_Flag>
        <Parent_String>Richmond Indep. City (independent city) [7022211], Virginia (state) [7007919], United States (nation) [7012149], North and Central America (continent) [1000001], World (facet) [7029392]</Parent_String>
        <Hier_Rel_Type>Whole/Part-BTP</Hier_Rel_Type>
      </Preferred_Parent>
    </Parent_Relationships>
    <Descriptive_Notes>
      <Descriptive_Note>
        <Note_Text>At head of James River; long impotant as tobacco center; site of trading post 1637 &amp; Fort Charles 1644; pillaged by British during Revolutionary War &amp; by Union forces during Civil War; State Capitol building (begun in 1795) designed by Thomas Jefferson.</Note_Text>
        <Note_Language>English</Note_Language>
        <Note_Contributors>
          <Note_Contributor>
            <Contributor_id>10000000/VP</Contributor_id>
          </Note_Contributor>
        </Note_Contributors>
        <Note_Sources>
          <Note_Source>
            <Source>
              <Source_ID>9006382/Encyclopaedia Britannica (1985)</Source_ID>
            </Source>
          </Note_Source>
        </Note_Sources>
      </Descriptive_Note>
    </Descriptive_Notes>
    <Record_Type>Administrative</Record_Type>
    <Merged_Status>Merged</Merged_Status>
    <Terms>
      <Preferred_Term>
        <Term_Text>Richmond</Term_Text>
        <Display_Name>N/A</Display_Name>
        <Historic_Flag>Current</Historic_Flag>
        <Vernacular>Vernacular</Vernacular>
        <Other_Flags>N/A</Other_Flags>
        <Term_ID>298597</Term_ID>
        <Term_Languages>
          <Term_Language>
            <Language>70001/undetermined</Language>
            <Preferred>Non Preferred</Preferred>
            <Qualifier>
            </Qualifier>
            <Term_Type>N/A</Term_Type>
            <Part_of_Speech>Noun</Part_of_Speech>
            <Lang_Stat>Undetermined</Lang_Stat>
          </Term_Language>
        </Term_Languages>
        <Term_Contributors>
          <Term_Contributor>
            <Contributor_id>10000000/VP</Contributor_id>
            <Preferred>Preferred</Preferred>
          </Term_Contributor>
          <Term_Contributor>
            <Contributor_id>10000001/BHA</Contributor_id>
            <Preferred>Non Preferred</Preferred>
          </Term_Contributor>
          <Term_Contributor>
            <Contributor_id>10000002/FDA</Contributor_id>
            <Preferred>Non Preferred</Preferred>
          </Term_Contributor>
          <Term_Contributor>
            <Contributor_id>10000003/GRLPSC</Contributor_id>
            <Preferred>Non Preferred</Preferred>
          </Term_Contributor>
        </Term_Contributors>
        <Term_Sources>
          <Term_Source>
            <Source>
              <Source_ID>2009007637/National Archives and Record Administration database (1987-)</Source_ID>
            </Source>
            <Page>
            </Page>
            <Preferred>Non Preferred</Preferred>
          </Term_Source>
          <Term_Source>
            <Source>
              <Source_ID>2009007642/Library of Congress HABS/HAER (1987-)</Source_ID>
            </Source>
            <Page>
            </Page>
            <Preferred>Non Preferred</Preferred>
          </Term_Source>
          <Term_Source>
            <Source>
              <Source_ID>2009008541/Encyclopaedia Britannica (1988)</Source_ID>
            </Source>
            <Page>10:54</Page>
            <Preferred>Unknown</Preferred>
          </Term_Source>
          <Term_Source>
            <Source>
              <Source_ID>9006447/Canby, Historic Places (1984)</Source_ID>
            </Source>
            <Page>2:779</Page>
            <Preferred>Unknown</Preferred>
          </Term_Source>
          <Term_Source>
            <Source>
              <Source_ID>9006449/Webster's Geographical Dictionary (1984)</Source_ID>
            </Source>
            <Page>
            </Page>
            <Preferred>Unknown</Preferred>
          </Term_Source>
          <Term_Source>
            <Source>
              <Source_ID>9006553/USGS, GNIS Digital Gazetteer (1994)</Source_ID>
            </Source>
            <Page>GNIS51022789</Page>
            <Preferred>Unknown</Preferred>
          </Term_Source>
        </Term_Sources>
      </Preferred_Term>
    </Terms>
    <Associative_Relationships>
      <Associative_Relationship>
        <Relationship_Type>capital of</Relationship_Type>
        <Related_Subject_ID>
          <VP_Subject_ID>7007919</VP_Subject_ID>
        </Related_Subject_ID>
        <Historic_Flag>Current</Historic_Flag>
      </Associative_Relationship>
    </Associative_Relationships>
    <Subject_Contributors>
      <Subject_Contributor>
        <Contributor_id>10000000/VP</Contributor_id>
      </Subject_Contributor>
      <Subject_Contributor>
        <Contributor_id>10000001/BHA</Contributor_id>
      </Subject_Contributor>
      <Subject_Contributor>
        <Contributor_id>10000002/FDA</Contributor_id>
      </Subject_Contributor>
      <Subject_Contributor>
        <Contributor_id>10000003/GRLPSC</Contributor_id>
      </Subject_Contributor>
    </Subject_Contributors>
    <Subject_Sources>
      <Subject_Source>
        <Source>
          <Source_ID>2009008541/Encyclopaedia Britannica (1988)</Source_ID>
        </Source>
      </Subject_Source>
      <Subject_Source>
        <Source>
          <Source_ID>9006447/Canby, Historic Places (1984)</Source_ID>
        </Source>
      </Subject_Source>
      <Subject_Source>
        <Source>
          <Source_ID>9006449/Webster's Geographical Dictionary (1984)</Source_ID>
        </Source>
      </Subject_Source>
      <Subject_Source>
        <Source>
          <Source_ID>9006553/USGS, GNIS Digital Gazetteer (1994)</Source_ID>
        </Source>
      </Subject_Source>
      <Subject_Source>
        <Source>
          <Source_ID>2009007637/National Archives and Record Administration database (1987-)</Source_ID>
        </Source>
      </Subject_Source>
      <Subject_Source>
        <Source>
          <Source_ID>2009007642/Library of Congress HABS/HAER (1987-)</Source_ID>
        </Source>
      </Subject_Source>
    </Subject_Sources>
    <Place_Types>
      <Preferred_Place_Type>
        <Place_Type_ID>1</Place_Type_ID>
        <Historic_Flag>Current</Historic_Flag>
        <PT_Date>
          <Display_Date>site explored in 1697 by settlers led by Christopher Newport &amp; John Smith, shortly after the founding of Jamestown</Display_Date>
          <Start_Date>1607</Start_Date>
          <End_Date>9999</End_Date>
        </PT_Date>
      </Preferred_Place_Type>
      <Non-Preferred_Place_Type>
        <Place_Type_ID>1</Place_Type_ID>
        <Historic_Flag>Current</Historic_Flag>
        <PT_Date>
          <Display_Date>since 1782</Display_Date>
          <Start_Date>1782</Start_Date>
          <End_Date>9999</End_Date>
        </PT_Date>
      </Non-Preferred_Place_Type>
      <Non-Preferred_Place_Type>
        <Place_Type_ID>1</Place_Type_ID>
        <Historic_Flag>Current</Historic_Flag>
        <PT_Date>
          <Display_Date>since 1779</Display_Date>
          <Start_Date>1779</Start_Date>
          <End_Date>9999</End_Date>
        </PT_Date>
      </Non-Preferred_Place_Type>
      <Non-Preferred_Place_Type>
        <Place_Type_ID>1</Place_Type_ID>
        <Historic_Flag>Current</Historic_Flag>
        <PT_Date>
          <Display_Date>of Henrico county</Display_Date>
          <Start_Date>1800</Start_Date>
          <End_Date>9999</End_Date>
        </PT_Date>
      </Non-Preferred_Place_Type>
      <Non-Preferred_Place_Type>
        <Place_Type_ID>1</Place_Type_ID>
        <Historic_Flag>Current</Historic_Flag>
      </Non-Preferred_Place_Type>
      <Non-Preferred_Place_Type>
        <Place_Type_ID>1</Place_Type_ID>
        <Historic_Flag>Current</Historic_Flag>
        <PT_Date>
          <Display_Date>produces chemicals, textiles, pharmaceuticals, wood &amp; paper products</Display_Date>
          <Start_Date>1800</Start_Date>
          <End_Date>9999</End_Date>
        </PT_Date>
      </Non-Preferred_Place_Type>
      <Non-Preferred_Place_Type>
        <Place_Type_ID>1</Place_Type_ID>
        <Historic_Flag>Current</Historic_Flag>
        <PT_Date>
          <Display_Date>site of Confederate White House, Robert E Lee House &amp; St. John's Church, scene of Patrick Henry's "Liberty or Death " speech</Display_Date>
          <Start_Date>1900</Start_Date>
          <End_Date>9999</End_Date>
        </PT_Date>
      </Non-Preferred_Place_Type>
      <Non-Preferred_Place_Type>
        <Place_Type_ID>1</Place_Type_ID>
        <Historic_Flag>Current</Historic_Flag>
      </Non-Preferred_Place_Type>
      <Non-Preferred_Place_Type>
        <Place_Type_ID>1</Place_Type_ID>
        <Historic_Flag>Historical</Historic_Flag>
        <PT_Date>
          <Display_Date>of the Confederacy during the Civil War</Display_Date>
          <Start_Date>1860</Start_Date>
          <End_Date>1865</End_Date>
        </PT_Date>
      </Non-Preferred_Place_Type>
      <Non-Preferred_Place_Type>
        <Place_Type_ID>1</Place_Type_ID>
        <Historic_Flag>Historical</Historic_Flag>
        <PT_Date>
          <Display_Date>city was burned on April 3, 1865 during Civil War</Display_Date>
          <Start_Date>1865</Start_Date>
          <End_Date>1865</End_Date>
        </PT_Date>
      </Non-Preferred_Place_Type>
    </Place_Types>
    <Coordinates>
      <Standard>
        <Latitude>
          <Degrees>37</Degrees>
          <Minutes>33</Minutes>
          <Seconds>0</Seconds>
          <Direction>North</Direction>
          <Decimal>38</Decimal>
        </Latitude>
        <Longitude>
          <Degrees>77</Degrees>
          <Minutes>27</Minutes>
          <Seconds>0</Seconds>
          <Direction>West</Direction>
          <Decimal>-77</Decimal>
        </Longitude>
      </Standard>
    </Coordinates>
  </Subject>

import re, os

source = "path_to_the_folder_with_TGN_xml_files"

def generateTGNdata(source):

    lof = os.listdir(source)

    tgnList = []
    tgnListNA = []
    count = 0

    for f in lof:
        if f.startswith("TGN"): # fileName test
            print(f)
            with open(source+f, "r", encoding="utf8") as f1:
                data = f1.read()

                data = re.split("</Subject>", data)

                for d in data:
                    d = re.sub("\n +", "", d)
                    #print(d)

                    if "Subject_ID" in d:
                        # SUBJECT ID
                        placeID = re.search(r"Subject_ID=\"(\d+)\"", d).group(1)
                        #print(placeID)

                        # NAME OF THE PLACE
                        placeName = re.search(r"<Term_Text>([^<]+)</Term_Text>", d).group(1)
                        #print(placeName)

                        # COORDINATES
                        if "<Coordinates>" in d:
                            latGr = re.search(r"<Latitude>(.*)</Latitude>", d).group(1)
                            lat = re.search(r"<Decimal>(.*)</Decimal>", latGr).group(1)

                            lonGr = re.search(r"<Longitude>(.*)</Longitude>", d).group(1)
                            lon = re.search(r"<Decimal>(.*)</Decimal>", lonGr).group(1)
                            #print(lat)
                            #print(lon)
                        else:
                            lat = "NA"
                            lon = "NA"

                        tgnList.append("\t".join([placeID, placeName, lat, lon]))
                        #input(tgnList)

                        if lat == "NA":
                            print("\t"+ "; ".join([placeID, placeName, lat, lon]))
                            tgnListNA.append("\t".join([placeID, placeName, lat, lon]))

    # saving
    header = "tgnID\tplacename\tlat\tlon\n"

    with open("tgn_data_light.tsv", "w", encoding="utf8") as f9a:
         f9a.write(header+"\n".join(tgnList))

    with open("tgn_data_light_NA.tsv", "w", encoding="utf8") as f9b:
         f9b.write(header+"\n".join(tgnListNA))

    print("TGN has %d items" % len(tgnList))

generateTGNdata(source)

#TGN has 2,487,572 items
#    17,613 items do not have coordinates.

Matching collected geographical data with Getty Gazetteer

First we want to load TGN fdata into a dictionary:

import re, os

def loadTGN(tgnTSV):
    with open(tgnTSV, "r", encoding="utf8") as f1:
        data = f1.read().split("\n")

        dic = {}

        for d in data:
            d = d.split("\t")

            dic[d[0]] = d

    return(dic)

Then we can start matching:


def match(freqFile, dicToMatch):
    with open(freqFile, "r", encoding="utf8") as f1:
        data = f1.read().split("\n")

        dataNew = []
        dataNewNA = []
        count = 0

        for d in data[1:]:
            tgnID = d.split("\t")[1]
            freq  = d.split("\t")[0]

            if tgnID in dicToMatch:
                val = "\t".join(dicToMatch[tgnID])
                val  = val + "\t" + freq

                if "\tNA\t" in val:
                    dataNewNA.append(val)
                else:
                    dataNew.append(val)
            else:
                print("%s (%d) not in TGN!" % (tgnID, int(freq)))
                count += 1

    header = "tgnID\tplacename\tlat\tlon\tfreq\n"

    with open("coord_"+freqFile, "w", encoding="utf8") as f9a:
        f9a.write(header + "\n".join(dataNew))

    with open("coord_NA_"+freqFile, "w", encoding="utf8") as f9b:
        f9b.write(header + "\n".join(dataNewNA))

    print("%d item have not been matched..." % count)

Running these two functions will look like:

dictionary = loadTGN("tgn_data_light.tsv")

match("dispatch_toponyms_1861.tsv", dictionary)
match("dispatch_toponyms_1862.tsv", dictionary)
match("dispatch_toponyms_1863.tsv", dictionary)
match("dispatch_toponyms_1864.tsv", dictionary)
match("dispatch_toponyms_1865.tsv", dictionary)

Generated files now can be loaded into QGIS as text-delimited layers, although there is a bit of a problem with TGN, which will become apparent soon.

Maps in QGIS

NB: In general, the following brief instructions should suffice; if you are confused, google it, or/and ask your comrades for help. In class, you can use this file.

  1. In QGIS:
    • Layer >
    • Add Layer >
    • Add Delimited Text Layer (need to point which columns are coordinates!)
    • If your data appears in a small area of the map, try to change projection
    • Style your layer (size of circles, transparency, labels, etc.)
  2. Sizing dots:
    • Right-click on your layer > Properties > Symbology >
      • From the drop down menu on top, select Graduated (Default: Single symbol)
        • In Column, select your column with frequencies
        • Method > choose, Size and adjust the size of the smallest and the largest dot (In Layer rendering you can adjust transparency of your dots).
        • In Classes, make sure to select Equal Interval (although feel free to play with other options as well) > Click Classify (you can change the number of classes and apply Symmetric Classification) > click Apply
        • Experiment with other options. If you get stuck, simply start over.
  3. In QGIS:
    • Use QuickMapServices (QMS) plugin to add a map layer quickly.
  4. In QGIS:
    • You can use TimeManager plugin to animate your maps over time.

Results

1862

1863

1864

Reference Materials:

Homework (1/2 and 2/2):

  1. GISting the “Dispatch” II: Mapping geographical data from the “Dispatch”
    • Extract toponyms (place names) from the “Dispatch” (python)
    • Calculate their frequencies (python)
    • Generate files for all years, all months, and all days (you can save results in subfolders, but, better, in single files with dates in another column)
    • Add coordinates to those places (QGIS/python)
      • Hint: you can create two lists 1) one with place names and their frequencies, 2) with place names and their coordinates; after that, you can merge them with python and generate a CSV file (placename, timeParameter, latitude, longitude, frequency) which can be used for creating a map in QGIS
    • Map them in QGIS, using frequencies to size the markers on the map
  2. Describe the process in a blogpost and publish on your website (including screenshots).
    • since you are provided with almost complete solutions here, you should do the following:
      • describe every line of code (or a block of code)
      • provide alternative code (to the whole or parts), explain why you think yours is better
      • try to find errors, inefficiencies, or/and flaws in the offered solutions. Explain—this will give you extra points.

Submitting homework:

  • Homework assignment must be submitted by the beginning of the next class;
  • Email your homework to the instructor.
    • if your homework is to create a file, email it as an attachment
    • if your homework is a blogpost on your website, email the link to your website and to the blogpost with your homework.
    • In the subject of your email, please, add the following: 070112-LXX-HW-YourLastName-YourMatriculationNumber, where LXX is the lesson for which the homework is submitted, YourLastName is your last name, and YourMatriculationNumber is your matriculation number.