Pandas semi structured JSON data frame to simple Pandas dataframe

I have a blob of data taken from a redshift cluster. the first 4 columns are separated by '|' then the 2 columns are JSON.

XXX|ABANDONED|1197|11|"{""currency"":""EUR""    item_id"":""143""   type"":""FLIGHT""   name"":""PAR-FEZ""  price"":1111    origin"":""PAR""    destination"":""FEZ""   merchant"":""GOV""  flight_type"":""OW""    flight_segment"":[{ origin"":""ORY""    destination"":""FEZ""   departure_date_time"":""2015-08-02T07:20""  arrival_date_time"":""2015-08-02T09:05""    carrier"":""AT""    f_class"":""ECONOMY""}]}"|"{""type"":""FLIGHT"" name"":""FI_ORY-OUD""   item_id"":""FLIGHT""    currency"":""EUR""  price"":111 origin"":""ORY""    destination"":""OUD""   flight_type"":""OW""    flight_segment"":[{""origin"":""ORY""   destination"":""OUD""   departure_date_time"":""2015-08-02T13:55""  arrival_date_time"":""2015-08-02T15:30""    flight_number"":""AT625""   carrier"":""AT""    f_class"":""ECONOMIC_DISCOUNTED""}]}"

Working in Python 2.7 would like to separate out the JSON values and convert it to a Pandas dataframe but I am to inexperienced in pyparsing to do so.

My approach was to read in the file as a Pandas data frame with '|' as the separator, than take the columns containing the JSON and flatten it using 'JSON_normalise' but JSON_normalise won't index over the panda's column

I have discovered solutions here and here but one is unsuitable to my 'mixed data' and the other is to simplistic for what is a rather large JSON file

Any tips on how to deploy Pyparsing on this data would be very helpful. Thanks

Pyparsing: Parsing semi-JSON nested plaintext data to a list

Parsing semi-structured json data(Python/R)

Solution

Taking your input string above as a variable named 'data', this Python+pyparsing code will make some sense of it. Unfortunately, that stuff to the right of the fourth '|' isn't really JSON. Fortunately, it is well enough formatted that it can be parsed without undue discomfort. See the embedded comments in the program below:

from pyparsing import *
from datetime import datetime

# for the most part, we suppress punctuation - it's important at parse time
# but just gets in the way afterwards
LBRACE,RBRACE,COLON,DBLQ,LBRACK,RBRACK = map(Suppress, '{}:"[]')
DBLQ2 = DBLQ + DBLQ

# define some scalar value expressions, including parse-time conversion parse actions
realnum = Regex(r'[+-]?\d+\.\d*').setParseAction(lambda t:float(t[0]))
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
timestamp = Regex(r'""\d{4}-\d{2}-\d{2}T\d{2}:\d{2}""')
timestamp.setParseAction(lambda t: datetime.strptime(t[0][2:-2],'%Y-%m-%dT%H:%M'))
string_value = QuotedString('""')

# define our base key ':' value expression; use a Forward() placeholder
# for now for value, since these things can be recursive
key = Optional(DBLQ2) + Word(alphas, alphanums+'_') + DBLQ2
value = Forward()
key_value = Group(key + COLON + value)

# objects can be values too - use the Dict class to capture keys as field names
obj = Group(Dict(LBRACE + OneOrMore(key_value) + RBRACE))
objlist = (LBRACK + ZeroOrMore(obj) + RBRACK)

# define expression for previously-declared value, using <<= operator
value <<= timestamp | string_value | realnum | integer | obj | Group(objlist)

# the outermost objects are enclosed in "s, and list of them can be given with '|' delims
quotedObj = DBLQ + obj + DBLQ
obsList = delimitedList(quotedObj, delim='|')

Now apply that parser to your 'data':

fields = data.split('|',4)
result = obsList.parseString(fields[-1])

# we get back a list of objects, dump them out
for r in result:
    print r.dump()
    print

Gives:

[['currency', 'EUR'], ['item_id', '143'], ['type', 'FLIGHT'], ['name', 'PAR-FEZ'], ['price', 1111], ['origin', 'PAR'], ['destination', 'FEZ'], ['merchant', 'GOV'], ['flight_type', 'OW'], ['flight_segment', [[['origin', 'ORY'], ['destination', 'FEZ'], ['departure_date_time', datetime.datetime(2015, 8, 2, 7, 20)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 9, 5)], ['carrier', 'AT'], ['f_class', 'ECONOMY']]]]]
- currency: EUR
- destination: FEZ
- flight_segment: 
  [0]:
    [['origin', 'ORY'], ['destination', 'FEZ'], ['departure_date_time', datetime.datetime(2015, 8, 2, 7, 20)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 9, 5)], ['carrier', 'AT'], ['f_class', 'ECONOMY']]
    - arrival_date_time: 2015-08-02 09:05:00
    - carrier: AT
    - departure_date_time: 2015-08-02 07:20:00
    - destination: FEZ
    - f_class: ECONOMY
    - origin: ORY
- flight_type: OW
- item_id: 143
- merchant: GOV
- name: PAR-FEZ
- origin: PAR
- price: 1111
- type: FLIGHT

[['type', 'FLIGHT'], ['name', 'FI_ORY-OUD'], ['item_id', 'FLIGHT'], ['currency', 'EUR'], ['price', 111], ['origin', 'ORY'], ['destination', 'OUD'], ['flight_type', 'OW'], ['flight_segment', [[['origin', 'ORY'], ['destination', 'OUD'], ['departure_date_time', datetime.datetime(2015, 8, 2, 13, 55)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 15, 30)], ['flight_number', 'AT625'], ['carrier', 'AT'], ['f_class', 'ECONOMIC_DISCOUNTED']]]]]
- currency: EUR
- destination: OUD
- flight_segment: 
  [0]:
    [['origin', 'ORY'], ['destination', 'OUD'], ['departure_date_time', datetime.datetime(2015, 8, 2, 13, 55)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 15, 30)], ['flight_number', 'AT625'], ['carrier', 'AT'], ['f_class', 'ECONOMIC_DISCOUNTED']]
    - arrival_date_time: 2015-08-02 15:30:00
    - carrier: AT
    - departure_date_time: 2015-08-02 13:55:00
    - destination: OUD
    - f_class: ECONOMIC_DISCOUNTED
    - flight_number: AT625
    - origin: ORY
- flight_type: OW
- item_id: FLIGHT
- name: FI_ORY-OUD
- origin: ORY
- price: 111
- type: FLIGHT

Note that the values that are not strings (integers, timestamps, etc.) have already been converted to Python types. Since the field names were saved as dict keys, you can access the fields by name as in:

res[0].currency
res[0].price
res[0].destination
res[0].flight_segment[0].origin
len(res[0].flight_segment) # gives how many segments