python json python-3.x python-jsonschema

How to store the nested json which is in a list to a text file using Python?

I am creating a nested json and i am storing it in a list object. Here is my code which is getting the proper hierarchical json as intended.

Sample Data:

datasource,datasource_cnt,category,category_cnt,subcategory,subcategory_cnt Bureau of Labor Statistics,44,Employment and wages,44,Employment and wages,44

import pandas as pd
df=pd.read_csv('queryhive16273.csv')
def split_df(df):
   for (vendor, count), df_vendor in df.groupby(["datasource", "datasource_cnt"]):
       yield {
           "vendor_name": vendor,
           "count": count,
           "categories": list(split_category(df_vendor))
       }

def split_category(df_vendor):
   for (category, count), df_category in df_vendor.groupby(
       ["category", "category_cnt"]
   ):
       yield {
           "name": category,
           "count": count,
           "subCategories": list(split_subcategory(df_category)),
       }

def split_subcategory(df_category):
   for (subcategory, count), df_subcategory in df_category.groupby(
       ["subcategory", "subcategory_cnt"]
   ):
       yield {
           "count": count,
           "name": subcategory,
             }


abc=list(split_df(df))

abc is containing the data as shown below. This is the intended result.

[{
    'count': 44,
    'vendor_name': 'Bureau of Labor Statistics',
    'categories': [{
        'count': 44,
        'name': 'Employment and wages',
        'subCategories': [{
            'count': 44,
            'name': 'Employment and wages'
        }]
    }]
}]

Now I am trying to store it into a json file.

with open('your_file2.json', 'w') as f:
    for item in abc:
       f.write("%s\n" % item)
        #f.write(abc)

Here comes the issue. This writes data in this fashion( refer below) which is not a valid json format. If i try to use json dump, it gives "json serialize error"

Could you please help me out here.

{
    'count': 44,
    'vendor_name': 'Bureau of Labor Statistics',
    'categories': [{
        'count': 44,
        'name': 'Employment and wages',
        'subCategories': [{
            'count': 44,
            'name': 'Employment and wages'
        }]
    }]
}

Expected Result :

[{
    "count": 44,
    "vendor_name": "Bureau of Labor Statistics",
    "categories": [{
        "count": 44,
        "name": "Employment and wages",
        "subCategories": [{
            "count": 44,
            "name": "Employment and wages"
        }]
    }]
}]

Solution

Using your data and PSL json gives me:

TypeError: Object of type 'int64' is not JSON serializable

Which just means some numpy object is living in your nested structure and does not have an encode method to convert it for JSON serialization.

Forcing encode to use string conversion when it lacks in the object itself is enough to make your code works:

import io
d = io.StringIO("datasource,datasource_cnt,category,category_cnt,subcategory,subcategory_cnt\nBureau of Labor Statistics,44,Employment and wages,44,Employment and wages,44")
df=pd.read_csv(d)

abc=list(split_df(df))

import json
json.dumps(abc, default=str)

It returns a valid JSON (but with int converted into str):

'[{"vendor_name": "Bureau of Labor Statistics", "count": "44", "categories": [{"name": "Employment and wages", "count": "44", "subCategories": [{"count": "44", "name": "Employment and wages"}]}]}]'

If it does not suit your needs, then use a dedicated Encoder:

import numpy as np
class MyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.int64):
            return int(obj)
        return json.JSONEncoder.default(self, obj)

json.dumps(abc, cls=MyEncoder)

This returns the requested JSON:

'[{"vendor_name": "Bureau of Labor Statistics", "count": 44, "categories": [{"name": "Employment and wages", "count": 44, "subCategories": [{"count": 44, "name": "Employment and wages"}]}]}]'

Another option is to directly convert your data before encoding:

def split_category(df_vendor):
   for (category, count), df_category in df_vendor.groupby(
       ["category", "category_cnt"]
   ):
       yield {
           "name": category,
           "count": int(count), # Cast here before encoding
           "subCategories": list(split_subcategory(df_category)),
       }