Search code examples
pythonpandasjson-normalize

Pandas JSON Normalize - Choose Correct Record Path


I am trying to figure out how to normalize the nested JSON response sampled below.

Right now, json_normalize(res,record_path=['data']) is giving me MOST of the data I need but what I would really like is the detail in the "session_pageviews" list/dict with the attributes of the data list/dic included.

I tried json_normalize(res,record_path=['data', ['session_pageviews']], meta = ['data']) but I get an error: ValueError: operands could not be broadcast together with shape (32400,) (180,)

I also tried json_normalize(res,record_path=['data'], max_level = 1) but that does not unnest session_pageviews

Any help would be appreciated!

enter image description here


Solution

  • You can try to apply the following function to your json:

    def flatten_nested_json_df(df):
        df = df.reset_index()
        s = (df.applymap(type) == list).all()
        list_columns = s[s].index.tolist()
        
        s = (df.applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    
        
        while len(list_columns) > 0 or len(dict_columns) > 0:
            new_columns = []
    
            for col in dict_columns:
                horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
                horiz_exploded.index = df.index
                df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
                new_columns.extend(horiz_exploded.columns) # inplace
    
            for col in list_columns:
                #print(f"exploding: {col}")
                df = df.drop(columns=[col]).join(df[col].explode().to_frame())
                new_columns.append(col)
    
            s = (df[new_columns].applymap(type) == list).all()
            list_columns = s[s].index.tolist()
    
            s = (df[new_columns].applymap(type) == dict).all()
            dict_columns = s[s].index.tolist()
        return df
    

    by doing this:

    df1= flatten_nested_json_df(df)
    

    where

    df = pd.json_normalize(json)
    

    That should give you all the information contained in your json.