Search code examples
pythonjsonpandasdataframedata-cleaning

json.load is showing error whereas read_json is not splitting data in columns accordingly


Hi I am trying to clean up data but having trouble reading json file as separate dataframe column. I have thousands of records like this in a file:

{"hotel_class": 4.0, 
"region_id": 60763,
"url": "http://www.tripadvisor.com/Hotel_Review-g60763-d113317-Reviews Casablanca_Hotel_Times_Square-New_York_City_New_York.html", 
"phone": "", 
"details": null,
"address": {"region": "NY", "street-address": "147 West 43rd Street", "postal-code": "10036", "locality": "New York City"},
"type": "hotel",
"id": 113317,
"name": "Casablanca Hotel Times Square"}

i tried to load it as:

with open('offering.txt') as datafile:
  data_json = json.load(datafile)

but it is giving an error i.e

JSONDecodeError: Extra data: line 2 column 1 (char 398)

so i tried doing it with

df=pd.read_json('offering.txt',lines=True)

but if i do it this way, my address column has nested values and i want to separate them in different columns. how to do it?

df['address']

0       {'region': 'NY', 'street-address': '147 West 4...
1       {'region': 'CA', 'street-address': '300 S Dohe...
2       {'region': 'NY', 'street-address': '790 Eighth...
3       {'region': 'NY', 'street-address': '152 West 5...
4       {'region': 'NY', 'street-address': '130 West 4...

Name: address, Length: 4333, dtype: object

Solution

  • Try:

    df = pd.read_json("offering.txt", lines=True)
    
    df_out = pd.concat([df, df.pop("address").apply(pd.Series)], axis=1)
    print(df_out)
    

    Prints:

       hotel_class  region_id                                                                                                                       url phone  details   type      id                           name region        street-address postal-code       locality
    0            4      60763  http://www.tripadvisor.com/Hotel_Review-g60763-d113317-Reviews Casablanca_Hotel_Times_Square-New_York_City_New_York.html            NaN  hotel  113317  Casablanca Hotel Times Square     NY  147 West 43rd Street       10036  New York City
    1            5      60763  http://www.tripadvisor.com/Hotel_Review-g60763-d113317-Reviews Casablanca_Hotel_Times_Square-New_York_City_New_York.html            NaN  hotel  113317  Casablanca Hotel Times Square     CA  147 West 43rd Street       10036  New York City
    
    ...