Search code examples
pandasdataframedictionarymulti-index

Unpacking dictionary data to create a multiindex panda dataframe


I have a dataset including nested dictionaries that I would like to unpack and form a multiindex dataframe.

The dataframe should have columns for Years with sub columns for the Result Types in each year (yield and quality in this case).

Typical dataset:

datalist= [
    {
        'trial': 'efr1',
        'location': 'aberdeen',
        '2010': {'yield': '100', 'quality': '97'},
        '2011': {'yield': '90', 'quality': '87'},
        '2012': {'yield': '88', 'quality': '90'}
    
    },
    {
        'trial': 'efr2',
        'location': 'bristol',
        '2010': {'yield': '88', 'quality': '90'},
        '2011': {'yield': '75', 'quality': '82'},
        '2012': {'yield': '77', 'quality': '80'}
    
    },
    {
        'trial': 'axy1',
        'location': 'newcastle',
        '2010': {'yield': '91', 'quality': '95'},
        '2011': {'yield': '93', 'quality': '93'},
        '2012': {'yield': '75', 'quality': '97'}
    }   
]

Using Dataframe.from_dict() produces a table with embedded dictionaries elements which I cannot figure out how to unpack and split to subcolumns.

On the other hand using json_normalize() produces a flat table with compound headings, and I cannot figure out how to to convert that into a multiindex frame with the structure I need...

Results with Dataframe.from_dict() and json_normalize()


Solution

  • If I understand correctly, you could use pandas.json_normalize, then convert the year.type columns to a MultiIndex (with str.split):

    out = (
     pd.json_normalize(datalist)
       .set_index(['trial', 'location'])
       .pipe(lambda d: d.set_axis(d.columns.str.split('.', expand=True), axis=1))
    )
    

    Variant:

    out =  pd.json_normalize(datalist).set_index(['trial', 'location'])
    out.columns = out.columns.str.split('.', expand=True)
    

    Output:

                     2010          2011          2012        
                    yield quality yield quality yield quality
    trial location                                           
    efr1  aberdeen    100      97    90      87    88      90
    efr2  bristol      88      90    75      82    77      80
    axy1  newcastle    91      95    93      93    75      97
    

    Note that setting the first two columns as index is optional, but if you don't the MultiIndex will be:

      trial   location  2010          2011          2012        
        NaN        NaN yield quality yield quality yield quality
    0  efr1   aberdeen   100      97    90      87    88      90
    1  efr2    bristol    88      90    75      82    77      80
    2  axy1  newcastle    91      95    93      93    75      97
    

    Intermediate after pd.json_normalize(datalist):

      trial   location 2010.yield 2010.quality 2011.yield 2011.quality 2012.yield 2012.quality
    0  efr1   aberdeen        100           97         90           87         88           90
    1  efr2    bristol         88           90         75           82         77           80
    2  axy1  newcastle         91           95         93           93         75           97