python list pandas dictionary data-munging

Using Pandas to merge 2 list of dicts with common elements

So I have 2 list of dicts..

list_yearly = [
{'name':'john',
 'total_year': 107
},
{'name':'cathy',
 'total_year':124
},
]

list_monthly =  [
{'name':'john',
 'month':'Jan',
 'total_month': 34
},
{'name':'cathy',
 'month':'Jan',
 'total_month':78
},
{'name':'john',
 'month':'Feb',
 'total_month': 73
},
{'name':'cathy',
 'month':'Feb',
 'total_month':46
},
]

The goal is to get a final dataset which looks like this :

{'name':'john',
 'total_year': 107,
 'trend':[{'month':'Jan', 'total_month': 34},{'month':'Feb', 'total_month': 73}]
 },

 {'name':'cathy',
  'total_year':124,
  'trend':[{'month':'Jan', 'total_month': 78},{'month':'Feb', 'total_month': 46}]
  },

Since my dataset is for a large number of students for all the 12 months of a particular year,I am using Pandas for the data munging..This is how I went about :

First combine both the lists into a single dataframe using the name key.

In [5]: df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))

In [6]: df
Out[6]:
     name    total_year month  total_month
0   john         107     Jan           34
1   john         107     Feb           73
2  cathy         124     Jan           78
3  cathy         124     Feb           46

Then create a trend column as a dict

ln [7]: df['trend'] = df.apply(lambda x: [x[['month', 'total_month']].to_dict()], axis=1)

In [8]: df
Out[8]:
    name    total_year month  total_month  \
0   john         107   Jan           34
1   john         107   Feb           73
2  cathy         124   Jan           78
3  cathy         124   Feb           46

                                  trend
0  [{u'total_month': 34, u'month': u'Jan'}]
1  [{u'total_month': 73, u'month': u'Feb'}]
2  [{u'total_month': 78, u'month': u'Jan'}]
3  [{u'total_month': 46, u'month': u'Feb'}]

And, use to_dict(orient='records') method of selected columns to convert it back into list of dicts:

In [9]: df[['name', 'total_year', 'trend']].to_dict(orient='records')
Out[9]:
[{'name': 'john',
  'total_year': 107,
  'trend': [{'month': 'Jan', 'total_month': 34}]},
 {'name': 'john',
  'total_year': 107,
  'trend': [{'month': 'Feb', 'total_month': 73}]},
 {'name': 'cathy',
  'total_year': 124,
  'trend': [{'month': 'Jan', 'total_month': 78}]},
 {'name': 'cathy',
  'total_year': 124,
  'trend': [{'month': 'Feb', 'total_month': 46}]}]

As is evident,the final dataset is not exactly what I want.Instead of the 2 dicts with both the months in it,I instead get 4 dicts with all the months separate.How can i fix this ? I would prefer fixing it within Pandas itself rather than using this final output to again reduce it to the desired state

Solution

You should actually use groupby to group based on name and total_year instead of apply (as second step) and in the groupby you can create the list you want. Example -

df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))

def func(group):
    result = []
    for idx, row in group.iterrows():
        result.append({'month':row['month'],'total_month':row['total_month']})
    return result

result = df.groupby(['name','total_year']).apply(func).reset_index()
result.columns = ['name','total_year','trend']
result_dict = result.to_dict(orient='records')

Demo -

In [9]: df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))

In [10]: df
Out[10]:
    name  total_year month  total_month
0   john         107   Jan           34
1   john         107   Feb           73
2  cathy         124   Jan           78
3  cathy         124   Feb           46

In [13]: def func(group):
   ....:     result = []
   ....:     for idx, row in group.iterrows():
   ....:         result.append({'month':row['month'],'total_month':row['total_month']})
   ....:     return result
   ....:

In [14]:

In [14]: result = df.groupby(['name','total_year']).apply(func).reset_index()

In [15]: result
Out[15]:
    name  total_year                                                  0
0  cathy         124  [{'month': 'Jan', 'total_month': 78}, {'month'...
1   john         107  [{'month': 'Jan', 'total_month': 34}, {'month'...

In [19]: result.columns = ['name','total_year','trend']

In [20]: result
Out[20]:
    name  total_year                                              trend
0  cathy         124  [{'month': 'Jan', 'total_month': 78}, {'month'...
1   john         107  [{'month': 'Jan', 'total_month': 34}, {'month'...

In [21]: result.to_dict(orient='records')
Out[21]:
[{'name': 'cathy',
  'total_year': 124,
  'trend': [{'month': 'Jan', 'total_month': 78},
   {'month': 'Feb', 'total_month': 46}]},
 {'name': 'john',
  'total_year': 107,
  'trend': [{'month': 'Jan', 'total_month': 34},
   {'month': 'Feb', 'total_month': 73}]}]