2 Dec 2019 • on data formats, xml, regular expressions, regex, python, batch conversion, text extraction, data extraction

L08 Structuring Data — Major Formats

Goals:

To learn about the major data formats and practice writing scripts that generate (or reformat) data into a specific format.

Software & Technologies:

python (simple scripting, regular expressions, batch processing)
basic data formats: csv/tsv (comma/tab-separated values), json, yml, etc.

Class:

explanation of major formats and their importance
hands-on: converting data into structured formats

The Essence

The ease of editing, suitability for analytical software, human-friendliness and readability, open vs. proprietary.

XML (Extensible Markup Language)

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

CSV/TSV (Comma-Separated Values/ Tab-Separated Values)

to,from,heading,body
Tove,Jani,Reminder,Don't forget me this weekend!

JSON (JavaScript Object Notation)

{
  "to": "Tove",
  "from": "Jani",
  "heading": "Reminder",
  "body": "Don't forget me this weekend!"
}

YML or YAML (Yet Another Markup Language > YAML Ain’t Markup Language)

to: Tove
from: Jani
heading: Reminder
body: Don't forget me this weekend

Larger Examples

NB data example from here.

There are some online converters that can help you to convert one format into another. For example: http://www.convertcsv.com/.

`CSV` / `TSV`

city,growth_from_2000_to_2013,latitude,longitude,population,rank,state
New York,4.8%,40.7127837,-74.0059413,8405837,1,New York
Los Angeles,4.8%,34.0522342,-118.2436849,3884307,2,California
Chicago,-6.1%,41.8781136,-87.6297982,2718782,3,Illinois
Houston,11.0%,29.7604267,-95.3698028,2195914,4,Texas
Philadelphia,2.6%,39.9525839,-75.1652215,1553165,5,Pennsylvania

TSV is a better option than a CSV, since TAB characters are very unlikely to appear in values.

Neither TSV not CSV are good for preserving new line characters (\n)—or, in other words, text split into multiple lines. As a workaround, one can convert \n into some unlikely-to-occur character combination (for example, ;;;), which would allow to restore \n later , if necessary.

`JSON`

[
    {
        "city": "New York", 
        "growth_from_2000_to_2013": "4.8%", 
        "latitude": 40.7127837, 
        "longitude": -74.0059413, 
        "population": "8405837", 
        "rank": "1", 
        "state": "New York"
    }, 
    {
        "city": "Los Angeles", 
        "growth_from_2000_to_2013": "4.8%", 
        "latitude": 34.0522342, 
        "longitude": -118.2436849, 
        "population": "3884307", 
        "rank": "2", 
        "state": "California"
    }, 
    {
        "city": "Chicago", 
        "growth_from_2000_to_2013": "-6.1%", 
        "latitude": 41.8781136, 
        "longitude": -87.6297982, 
        "population": "2718782", 
        "rank": "3", 
        "state": "Illinois"
    }, 
    {
        "city": "Houston", 
        "growth_from_2000_to_2013": "11.0%", 
        "latitude": 29.7604267, 
        "longitude": -95.3698028, 
        "population": "2195914", 
        "rank": "4", 
        "state": "Texas"
    }, 
    {
        "city": "Philadelphia", 
        "growth_from_2000_to_2013": "2.6%", 
        "latitude": 39.9525839, 
        "longitude": -75.1652215, 
        "population": "1553165", 
        "rank": "5", 
        "state": "Pennsylvania"
    }
]

`YML`/`YAML`

YAML is often used only for a single set of parameters.

city: New York 
growth_from_2000_to_2013: 4.8% 
latitude: 40.7127837 
longitude: -74.0059413
population: 8405837 
rank: 1 
state: New York

But it can also be used for storage of serialized data. It has advantages of both JSON and CSV: the overall simplicity of the format (no tricky syntax) is similar to that of CSV/TSV, but it is more readable than CSV/TSV in any text editor, and is more difficult to break—again, due to the simplicity of the format.

New York:
  growth_from_2000_to_2013: 4.8% 
  latitude: 40.7127837 
  longitude: -74.0059413 
  population: 8405837 
  rank: 1 
  state: New York 
Los Angeles:
  growth_from_2000_to_2013: 4.8% 
  latitude: 34.0522342 
  longitude: -118.2436849 
  population: 3884307 
  rank: 2 
  state: California
Chicago:
  growth_from_2000_to_2013: -6.1% 
  latitude: 41.8781136 
  longitude: -87.6297982 
  population: 2718782 
  rank: 3 
  state: Illinois
Houston:
  growth_from_2000_to_2013: 11.0% 
  latitude: 29.7604267 
  longitude: -95.3698028 
  population: 2195914 
  rank: 4 
  state: Texas
Philadelphia:
  growth_from_2000_to_2013: 2.6% 
  latitude: 39.9525839 
  longitude: -75.1652215 
  population: 1553165 
  rank: 5 
  state: Pennsylvania

YAML files can be read with Python into dictionaries like so:

import yaml
dictionary = yaml.load(open(pathToFile))

You will most likely need to install yaml library; it is also quite easy to write a script that would read such serialized data.

Reference Materials:

https://beginnersbook.com/2015/04/json-tutorial/
More broadly: Baker, James. 2014. “Preserving Your Research Data.” Programming Historian, April. https://programminghistorian.org/lessons/preserving-your-research-data.

Homework:

reformatting the “Dispatch”: generate one TSV-file with the entire content of the “Dispatch”, where each article (or, more broadly—an entry) is a single record; each record should include:
1. date;
2. type of an entry (there are articles, advertisements, notices, etc);
3. header;
4. the text of an entry.
publish the description of this process (with excerpts of code) on your website.
Codecademy’s Learn Python, Unit 9-10.
Github: publish the confirmation screenshot as a post on your new site.

Homework solution

import re, os

source = "path_where_initial_files_are"
target = "path_to_save_new_files"

lof = os.listdir(source)
counter = 0 # general counter to keep track of the progress

ourCSV = []

for f in lof:
    if f.startswith("dltext"): # fileName test        
        with open(source + f, "r", encoding="utf8") as f1:
            text = f1.read()

            # try to find the date
            date = re.search(r'<date value="([\d-]+)"', text).group(1)

            # splitting the issue into articles/items
            split = re.split("<div3 ", text)

            c = 0 # item counter
            for s in split[1:]:
                c += 1
                s = "<div3 " + s # a step to restore the integrity of items
                #input(s)

                # try to find a unitType
                try:
                    unitType = re.search(r'type="([^\"]+)"', s).group(1)
                except:
                    unitType = "noType"
                    print(s)

                # try to find a header
                try:
                    header = re.search(r'<head>(.*)</head>', s).group(1)
                    header = re.sub("<[^<]+>", "", header)
                except:
                    header = "NO HEADER"
                    #print("No header found!")

                text = re.sub("<[^<]+>", "", s)
                text = re.sub(" +\n|\n +", "\n", text)
                text = re.sub("\n+", ";;; ", text)

                # generating necessary bits 
                fName = date+"_"+unitType+"_"+str(c)

                itemID = date+"_"+unitType+"_"+str(c)
                dateVar   = date
                #unitType = unitType
                #header = header
                text = text.replace("\t", " ")

                # creating a text variable
                var = "\t".join([itemID,dateVar,unitType,header,text])
                #input(var)

                ourCSV.append(var)


        # count processed issues and print progress counter at every 100        
        counter += 1
        if counter % 100 == 0:
            print(counter)

# saving
with open("dispatch_as_TSV.csv", "w", encoding="utf8") as f9:
    f9.write("\n".join(ourCSV))
print(counter)

Submitting homework:

Homework assignment must be submitted by the beginning of the next class;
Email your homework to the instructor.
- if your homework is to create a file, email it as an attachment
- if your homework is a blogpost on your website, email the link to your website and to the blogpost with your homework.
- In the subject of your email, please, add the following: 070112-LXX-HW-YourLastName-YourMatriculationNumber, where LXX is the lesson for which the homework is submitted, YourLastName is your last name, and YourMatriculationNumber is your matriculation number.