Search code examples
pythonjsonpandasdictionarynested

Any depth nested dict to pandas dataframe


I've been fighting to go from a nested dictionary of depth D to a pandas DataFrame.

I've tried with recursive function, like the following one, but my problem is that when I'm iterating over a KEY, I don't know what was the pervious key.

I've also tried with json.normalize, pandas from dict but I always end up with dots in the columns...

Example code:

def iterate_dict(d, i = 2, cols = []):
    
    for k, v in d.items():
        # missing here how to check for the previous key
        # so that I can create an structure to create the dataframe.
        if type(v) is dict:   
            print('this is k: ', k)  
            if i % 2  == 0:
                cols.append(k)
            i+=1
            iterate_dict(v, i, cols)            
        else:
            print('this is k2: ' , k, ': ', v)

  
iterate_dict(test2)

This is an example of how my dictionary looks like:

# example 2 
test = {
    'column-gender': {
        'male': {
            'column-country' : {
                'FRENCH': {
                    'column-class': [0,1]
                },
                ('SPAIN','ITALY') : {
                    'column-married' : {
                        'YES': {
                            'column-class' : [0,1]
                        },
                        'NO' : {
                            'column-class' : 2
                        }
                    }
                }
            }
        },
        'female': {
            'column-country' : {
                ('FRENCH', 'SPAIN') : {
                    'column-class' : [[1,2],'#']
                },
                'REST-OF-VALUES': {
                    'column-married' : '*'
                }
            }
        }
    }
}

And this is how I want the dataframe to look like:

enter image description here

Any suggestion is welcome :)


Solution

  • If the column-keys are consistently prefixed with column-, you can create a recursive function:

    def data_to_df(data):
        rec_out = []
        def dict_to_rec(d, curr_row={}):
            for k, v in d.items():
                if 'column-' in k: # definition of a column
                    if isinstance(v, dict):
                        for val, nested_dict in v.items():
                            dict_to_rec(nested_dict, dict(curr_row, **{k[7:]: val}))
                    else:
                        rec_out.append(dict(curr_row, **{k[7:]: v}))
        dict_to_rec(data)
        return pd.DataFrame(rec_out)
    
    print(data_to_df(test))
    

    Edit: removing unnecessary variable and argument

    Output:

       gender          country        class married
    0    male           FRENCH       [0, 1]     NaN
    1    male   (SPAIN, ITALY)          YES  [0, 1]
    2    male   (SPAIN, ITALY)           NO       2
    3  female  (FRENCH, SPAIN)  [[1, 2], #]     NaN
    4  female   REST-OF-VALUES            *     NaN