Search code examples
jsonpython-2.7pandasdata-cleaningto-json

How to convert pandas Series to desired JSON format?


I am having the following data on which I need to do apply aggregation function followed by groupby.

My data is as follows: data.csv

id,category,sub_category,count
0,x,sub1,10
1,x,sub2,20
2,x,sub2,10
3,y,sub3,30
4,y,sub3,5
5,y,sub4,15
6,z,sub5,20

Here I'm trying to get the count by sub-category wise. After that I need to store the result in JSON format. The following piece of code helps me in achieving that. test.py

import pandas as pd
df = pd.read_csv('data.csv')
sub_category_total = df['count'].groupby([df['category'], df['sub_category']]).sum()
print sub_category_total.reset_index().to_json(orient = "records")

The above code gives me the following format.

[{"category":"x","sub_category":"sub1","count":10},{"category":"x","sub_category":"sub2","count":30},{"category":"y","sub_category":"sub3","count":35},{"category":"y","sub_category":"sub4","count":15},{"category":"z","sub_category":"sub5","count":20}]

But, my desired format is as follows:

{
"x":[{
     "sub_category":"sub1",
     "count":10
     },
     {
     "sub_category":"sub2",
      "count":30}],
"y":[{
     "sub_category":"sub3",
     "count":35
     },
     {
     "sub_category":"sub4",
     "count":15}],
"z":[{
     "sub_category":"sub5",
      "count":20}]
}

By following the discussions @ How to convert pandas DataFrame result to user defined json format, I replaced the last 2 lines of test.py with,

g = df.groupby('category')[["sub_category","count"]].apply(lambda x: x.to_dict(orient='records'))
print g.to_json()

It gives me the following output.

{"x":[{"count":10,"sub_category":"sub1"},{"count":20,"sub_category":"sub2"},{"count":10,"sub_category":"sub2"}],"y":[{"count":30,"sub_category":"sub3"},{"count":5,"sub_category":"sub3"},{"count":15,"sub_category":"sub4"}],"z":[{"count":20,"sub_category":"sub5"}]}

Though the above result is somewhat similar to my desired format, I couldn't perform any aggregation function over here as it throws error saying 'numpy.int64' object has no attribute 'to_dict'. Hence, I end up getting all of the rows in the data file.

Can somebody help me in achieving the above JSON format?


Solution

  • I think you can first aggregate with sum, parameter as_index=False was added to groupby, so output is Dataframe df1 and then use other solution:

    df1 = (df.groupby(['category','sub_category'], as_index=False)['count'].sum())
    print (df1)
      category sub_category  count
    0        x         sub1     10
    1        x         sub2     30
    2        y         sub3     35
    3        y         sub4     15
    4        z         sub5     20
    
    g = df1.groupby('category')[["sub_category","count"]]
           .apply(lambda x: x.to_dict(orient='records'))
    
    print (g.to_json())
    
    {
        "x": [{
            "sub_category": "sub1",
            "count": 10
        }, {
            "sub_category": "sub2",
            "count": 30
        }],
        "y": [{
            "sub_category": "sub3",
            "count": 35
        }, {
            "sub_category": "sub4",
            "count": 15
        }],
        "z": [{
            "sub_category": "sub5",
            "count": 20
        }]
    }