Search code examples
pythonjsonpython-2.7pyparsing

Pandas semi structured JSON data frame to simple Pandas dataframe


I have a blob of data taken from a redshift cluster. the first 4 columns are separated by '|' then the 2 columns are JSON.

XXX|ABANDONED|1197|11|"{""currency"":""EUR""    item_id"":""143""   type"":""FLIGHT""   name"":""PAR-FEZ""  price"":1111    origin"":""PAR""    destination"":""FEZ""   merchant"":""GOV""  flight_type"":""OW""    flight_segment"":[{ origin"":""ORY""    destination"":""FEZ""   departure_date_time"":""2015-08-02T07:20""  arrival_date_time"":""2015-08-02T09:05""    carrier"":""AT""    f_class"":""ECONOMY""}]}"|"{""type"":""FLIGHT"" name"":""FI_ORY-OUD""   item_id"":""FLIGHT""    currency"":""EUR""  price"":111 origin"":""ORY""    destination"":""OUD""   flight_type"":""OW""    flight_segment"":[{""origin"":""ORY""   destination"":""OUD""   departure_date_time"":""2015-08-02T13:55""  arrival_date_time"":""2015-08-02T15:30""    flight_number"":""AT625""   carrier"":""AT""    f_class"":""ECONOMIC_DISCOUNTED""}]}"   

Working in Python 2.7 would like to separate out the JSON values and convert it to a Pandas dataframe but I am to inexperienced in pyparsing to do so.

My approach was to read in the file as a Pandas data frame with '|' as the separator, than take the columns containing the JSON and flatten it using 'JSON_normalise' but JSON_normalise won't index over the panda's column

I have discovered solutions here and here but one is unsuitable to my 'mixed data' and the other is to simplistic for what is a rather large JSON file

Any tips on how to deploy Pyparsing on this data would be very helpful. Thanks

Pyparsing: Parsing semi-JSON nested plaintext data to a list

Parsing semi-structured json data(Python/R)


Solution

  • Taking your input string above as a variable named 'data', this Python+pyparsing code will make some sense of it. Unfortunately, that stuff to the right of the fourth '|' isn't really JSON. Fortunately, it is well enough formatted that it can be parsed without undue discomfort. See the embedded comments in the program below:

    from pyparsing import *
    from datetime import datetime
    
    # for the most part, we suppress punctuation - it's important at parse time
    # but just gets in the way afterwards
    LBRACE,RBRACE,COLON,DBLQ,LBRACK,RBRACK = map(Suppress, '{}:"[]')
    DBLQ2 = DBLQ + DBLQ
    
    # define some scalar value expressions, including parse-time conversion parse actions
    realnum = Regex(r'[+-]?\d+\.\d*').setParseAction(lambda t:float(t[0]))
    integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
    timestamp = Regex(r'""\d{4}-\d{2}-\d{2}T\d{2}:\d{2}""')
    timestamp.setParseAction(lambda t: datetime.strptime(t[0][2:-2],'%Y-%m-%dT%H:%M'))
    string_value = QuotedString('""')
    
    # define our base key ':' value expression; use a Forward() placeholder
    # for now for value, since these things can be recursive
    key = Optional(DBLQ2) + Word(alphas, alphanums+'_') + DBLQ2
    value = Forward()
    key_value = Group(key + COLON + value)
    
    # objects can be values too - use the Dict class to capture keys as field names
    obj = Group(Dict(LBRACE + OneOrMore(key_value) + RBRACE))
    objlist = (LBRACK + ZeroOrMore(obj) + RBRACK)
    
    # define expression for previously-declared value, using <<= operator
    value <<= timestamp | string_value | realnum | integer | obj | Group(objlist)
    
    # the outermost objects are enclosed in "s, and list of them can be given with '|' delims
    quotedObj = DBLQ + obj + DBLQ
    obsList = delimitedList(quotedObj, delim='|')
    

    Now apply that parser to your 'data':

    fields = data.split('|',4)
    result = obsList.parseString(fields[-1])
    
    # we get back a list of objects, dump them out
    for r in result:
        print r.dump()
        print
    

    Gives:

    [['currency', 'EUR'], ['item_id', '143'], ['type', 'FLIGHT'], ['name', 'PAR-FEZ'], ['price', 1111], ['origin', 'PAR'], ['destination', 'FEZ'], ['merchant', 'GOV'], ['flight_type', 'OW'], ['flight_segment', [[['origin', 'ORY'], ['destination', 'FEZ'], ['departure_date_time', datetime.datetime(2015, 8, 2, 7, 20)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 9, 5)], ['carrier', 'AT'], ['f_class', 'ECONOMY']]]]]
    - currency: EUR
    - destination: FEZ
    - flight_segment: 
      [0]:
        [['origin', 'ORY'], ['destination', 'FEZ'], ['departure_date_time', datetime.datetime(2015, 8, 2, 7, 20)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 9, 5)], ['carrier', 'AT'], ['f_class', 'ECONOMY']]
        - arrival_date_time: 2015-08-02 09:05:00
        - carrier: AT
        - departure_date_time: 2015-08-02 07:20:00
        - destination: FEZ
        - f_class: ECONOMY
        - origin: ORY
    - flight_type: OW
    - item_id: 143
    - merchant: GOV
    - name: PAR-FEZ
    - origin: PAR
    - price: 1111
    - type: FLIGHT
    
    [['type', 'FLIGHT'], ['name', 'FI_ORY-OUD'], ['item_id', 'FLIGHT'], ['currency', 'EUR'], ['price', 111], ['origin', 'ORY'], ['destination', 'OUD'], ['flight_type', 'OW'], ['flight_segment', [[['origin', 'ORY'], ['destination', 'OUD'], ['departure_date_time', datetime.datetime(2015, 8, 2, 13, 55)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 15, 30)], ['flight_number', 'AT625'], ['carrier', 'AT'], ['f_class', 'ECONOMIC_DISCOUNTED']]]]]
    - currency: EUR
    - destination: OUD
    - flight_segment: 
      [0]:
        [['origin', 'ORY'], ['destination', 'OUD'], ['departure_date_time', datetime.datetime(2015, 8, 2, 13, 55)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 15, 30)], ['flight_number', 'AT625'], ['carrier', 'AT'], ['f_class', 'ECONOMIC_DISCOUNTED']]
        - arrival_date_time: 2015-08-02 15:30:00
        - carrier: AT
        - departure_date_time: 2015-08-02 13:55:00
        - destination: OUD
        - f_class: ECONOMIC_DISCOUNTED
        - flight_number: AT625
        - origin: ORY
    - flight_type: OW
    - item_id: FLIGHT
    - name: FI_ORY-OUD
    - origin: ORY
    - price: 111
    - type: FLIGHT
    

    Note that the values that are not strings (integers, timestamps, etc.) have already been converted to Python types. Since the field names were saved as dict keys, you can access the fields by name as in:

    res[0].currency
    res[0].price
    res[0].destination
    res[0].flight_segment[0].origin
    len(res[0].flight_segment) # gives how many segments