Search code examples
pythonlistpandasdictionarydata-munging

Using Pandas to merge 2 list of dicts with common elements


So I have 2 list of dicts..

list_yearly = [
{'name':'john',
 'total_year': 107
},
{'name':'cathy',
 'total_year':124
},
]

list_monthly =  [
{'name':'john',
 'month':'Jan',
 'total_month': 34
},
{'name':'cathy',
 'month':'Jan',
 'total_month':78
},
{'name':'john',
 'month':'Feb',
 'total_month': 73
},
{'name':'cathy',
 'month':'Feb',
 'total_month':46
},
]

The goal is to get a final dataset which looks like this :

{'name':'john',
 'total_year': 107,
 'trend':[{'month':'Jan', 'total_month': 34},{'month':'Feb', 'total_month': 73}]
 },

 {'name':'cathy',
  'total_year':124,
  'trend':[{'month':'Jan', 'total_month': 78},{'month':'Feb', 'total_month': 46}]
  },

Since my dataset is for a large number of students for all the 12 months of a particular year,I am using Pandas for the data munging..This is how I went about :

First combine both the lists into a single dataframe using the name key.

In [5]: df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))

In [6]: df
Out[6]:
     name    total_year month  total_month
0   john         107     Jan           34
1   john         107     Feb           73
2  cathy         124     Jan           78
3  cathy         124     Feb           46

Then create a trend column as a dict

ln [7]: df['trend'] = df.apply(lambda x: [x[['month', 'total_month']].to_dict()], axis=1)

In [8]: df
Out[8]:
    name    total_year month  total_month  \
0   john         107   Jan           34
1   john         107   Feb           73
2  cathy         124   Jan           78
3  cathy         124   Feb           46

                                  trend
0  [{u'total_month': 34, u'month': u'Jan'}]
1  [{u'total_month': 73, u'month': u'Feb'}]
2  [{u'total_month': 78, u'month': u'Jan'}]
3  [{u'total_month': 46, u'month': u'Feb'}]

And, use to_dict(orient='records') method of selected columns to convert it back into list of dicts:

In [9]: df[['name', 'total_year', 'trend']].to_dict(orient='records')
Out[9]:
[{'name': 'john',
  'total_year': 107,
  'trend': [{'month': 'Jan', 'total_month': 34}]},
 {'name': 'john',
  'total_year': 107,
  'trend': [{'month': 'Feb', 'total_month': 73}]},
 {'name': 'cathy',
  'total_year': 124,
  'trend': [{'month': 'Jan', 'total_month': 78}]},
 {'name': 'cathy',
  'total_year': 124,
  'trend': [{'month': 'Feb', 'total_month': 46}]}]

As is evident,the final dataset is not exactly what I want.Instead of the 2 dicts with both the months in it,I instead get 4 dicts with all the months separate.How can i fix this ? I would prefer fixing it within Pandas itself rather than using this final output to again reduce it to the desired state


Solution

  • You should actually use groupby to group based on name and total_year instead of apply (as second step) and in the groupby you can create the list you want. Example -

    df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))
    
    def func(group):
        result = []
        for idx, row in group.iterrows():
            result.append({'month':row['month'],'total_month':row['total_month']})
        return result
    
    result = df.groupby(['name','total_year']).apply(func).reset_index()
    result.columns = ['name','total_year','trend']
    result_dict = result.to_dict(orient='records')
    

    Demo -

    In [9]: df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))
    
    In [10]: df
    Out[10]:
        name  total_year month  total_month
    0   john         107   Jan           34
    1   john         107   Feb           73
    2  cathy         124   Jan           78
    3  cathy         124   Feb           46
    
    In [13]: def func(group):
       ....:     result = []
       ....:     for idx, row in group.iterrows():
       ....:         result.append({'month':row['month'],'total_month':row['total_month']})
       ....:     return result
       ....:
    
    In [14]:
    
    In [14]: result = df.groupby(['name','total_year']).apply(func).reset_index()
    
    In [15]: result
    Out[15]:
        name  total_year                                                  0
    0  cathy         124  [{'month': 'Jan', 'total_month': 78}, {'month'...
    1   john         107  [{'month': 'Jan', 'total_month': 34}, {'month'...
    
    In [19]: result.columns = ['name','total_year','trend']
    
    In [20]: result
    Out[20]:
        name  total_year                                              trend
    0  cathy         124  [{'month': 'Jan', 'total_month': 78}, {'month'...
    1   john         107  [{'month': 'Jan', 'total_month': 34}, {'month'...
    
    In [21]: result.to_dict(orient='records')
    Out[21]:
    [{'name': 'cathy',
      'total_year': 124,
      'trend': [{'month': 'Jan', 'total_month': 78},
       {'month': 'Feb', 'total_month': 46}]},
     {'name': 'john',
      'total_year': 107,
      'trend': [{'month': 'Jan', 'total_month': 34},
       {'month': 'Feb', 'total_month': 73}]}]