Search code examples
pythonpandasdataframelist-comprehensiondefaultdict

How to convert pandas dataframe to defaultdict (class, list) where one of the column values is used as keys?


In the given pandas dataframe:

df = 

     contig       pos  PI_index  hapX_My_Sum  hapY_My_Sum  hapX_Sp_Sum       
 0  2  16229767           726          0.0         12.0          3.5   
 1  2  16229783           726          0.0         12.0          3.5   
 3  2  16229880           726          0.0         12.0          2.0   
 4  2  16230491           255         12.0          0.0          0.0   
 5  2  16230503           255         12.0          0.0          0.0   
 6  2  16232072           255         11.0          1.0          0.0   
 7  2  16232072           255         11.0          1.0          0.0   
 8  2  16232282          3353         11.0          1.0          0.0   
 9  2  16232444          3353         11.0          1.0          0.0   
 10 2  16232444          3353         11.0          1.0          0.0   

I want to convert this dataframe to dictionary of dictionary i.e default(dict)

So, I did:

from collections import defaultdict
df_dict = df.to_dict('index')

print(df_dict)  # gives me
{0: {'hapY_My_Sum': 12.0, 'hapX_Sp_Sum': 3.5 .....}

All, is good but instead of using the main pandas index I want to use the PI_index as the indexes to generate defaultdict(<class 'dict'> where PI_index values are the keys to do downstream analyses.

The print output of the defaultdict should be like:

defaultdict(<class 'dict'>, {'726': {'contig': '2', 'hapX_My_Sum': ['0.0', '0.0', '0.0'], 'hapY_My_Sum': ['12.0', '12.0', '12.0'], ....}, '255':{'contig': '2', 'hapX_My_Sum': [....]....}})

Post edit:

  • I forgot to add but is there a way to unselected certain columns if undesired, but I don't want to drop them out from pandas data frame.
  • Also, what if I only want one value in contig since they will all be the same.

So, downstream I can do something like:

for k in df_dict:
    contig = df_dict[k]['chr']

    hapX_My_product = reduce(mul, (float(x) for x in (df_dict[k]['hapX_My_Sum'])))

Solution

  • Is that what you want?

    In [11]: cols = ['contig','PI_index','hapX_My_Sum']
    
    In [12]: df[cols].groupby('PI_index') \
                     .apply(lambda x: x.set_index('PI_index').to_dict('list')) \
                     .to_dict()
    Out[12]:
    {255: {'contig': [2, 2, 2, 2], 'hapX_My_Sum': [12.0, 12.0, 11.0, 11.0]},
     726: {'contig': [2, 2, 2], 'hapX_My_Sum': [0.0, 0.0, 0.0]},
     3353: {'contig': [2, 2, 2], 'hapX_My_Sum': [11.0, 11.0, 11.0]}}
    

    Some explanation:

    first we generate dictionaries for each group

    In [87]: df[cols].groupby('PI_index') \
        ...:         .apply(lambda x: x.set_index('PI_index').to_dict('list'))
    Out[87]:
    PI_index
    255     {'contig': [2, 2, 2, 2], 'hapX_My_Sum': [12.0,...
    726     {'contig': [2, 2, 2], 'hapX_My_Sum': [0.0, 0.0...
    3353    {'contig': [2, 2, 2], 'hapX_My_Sum': [11.0, 11...
    dtype: object
    

    now we can export rows as dictionary, setting corresponding index and using default orient='dict'

    In [88]: df[cols].groupby('PI_index') \
        ...:         .apply(lambda x: x.set_index('PI_index').to_dict('list')) \
        ...:         .to_dict()
    Out[88]:
    {255: {'contig': [2, 2, 2, 2], 'hapX_My_Sum': [12.0, 12.0, 11.0, 11.0]},
     726: {'contig': [2, 2, 2], 'hapX_My_Sum': [0.0, 0.0, 0.0]},
     3353: {'contig': [2, 2, 2], 'hapX_My_Sum': [11.0, 11.0, 11.0]}}