Search code examples
pythonjsonpandasdataframejson-normalize

(Pandas/Dataframe) pandas.json_normalize on nested JSON data without uniform record_path


I'm attempting to convert a large JSON file to a CSV, but the field that I need to be able to sort data on in the Spreadsheet is all in one cell whenever I convert it to CSV/Normalize the JSON. The main thing I need is the hits list of dictionaries not all be in the same cell when I convert it to a csv. (Structure is: a Dictionary of Dictionaries which contains a List of Dictionaries)

Here's an example of what the JSON would look like: https://pastebin.com/VA5mfhfB

Here's how I've tried doing it (and what gives somewhat of an output):

df = pd.json_normalize(boss_dictionary)
df.to_csv(r'data.csv', index=None)

I've tried putting a record_path parameter, but because there isn't a "uniform" boss_id (the slew of numbers beforehand), I can't figure out how to normalize the hits list of dictionaries.

Another thing that I've tried:

df = pd.read_json('data.json')
df.to_csv(r'data.csv', index=None)

Which does something similar to what I need, but not what I actually need. The hit list is just in one cell instead of being normalized out.

What I've tried to fix it:

I've tried to normalize it with the dictionary itself, and read it from JSON. I've read the documentation on json_normalize, but no parameters of meta or record_path netted me any result that didn't raise an exception.


Solution

  • Using json_normalize with in a list comp based off keys. Finally merge and explode.

    from ast import literal_eval
    
    import pandas as pd
    
    
    data = literal_eval(open("/path/to/file/data.txt").read())
    
    df_meta = (
        pd
        .concat([pd.json_normalize(data=data[x]) for x in data], keys=data.keys())
        .droplevel(level=1)
        .reset_index(names="id")
    )
    
    df_records = (
        pd
        .concat([pd.json_normalize(data=data[x], record_path=["hits"]) for x in data], keys=data.keys())
        .droplevel(level=1)
        .reset_index(names="id")
    )
    
    df_final = pd.merge(left=df_meta, right=df_records).drop(columns="hits")
    df_final = df_final.explode("hp_list").reset_index(drop=True)