python pandas dataframe numpy python-3.8

TypeError: Object of type NAType is not JSON serializable

Thank you in advance for your help.

My python code reads json input file and loads the data into a data frame, masks or changes on the data frame column specified by configuration and in the last stage, creates json output file.

read json into data frame --> mask/change the df column ---> generate json

Input json:

[
    {
        "BinLogFilename": "mysql.log",
        "Type": "UPDATE",
        "Table": "users",
        "ServerId": 1,
        "BinLogPosition": 2111
    },        {
    {   "BinLogFilename": "mysql.log",
        "Type": "UPDATE",
        "Table": "users",
        "ServerId": null,
        "BinLogPosition": 2111
    },
  ...
]

when I load the above json into data frame, the data frame column "ServerId" has float values because it has null in few blocks of json input.

The main central logic converts/fakes "ServerId" into another number, however the output contains float numbers.

Output json:

[
      {
            "BinLogFilename": "mysql.log",
            "Type": "UPDATE",
            "Table": "users",
            "ServerId": 5627.0,
            "BinLogPosition": 2111
        }, 
        {
            "BinLogFilename": "mysql.log",
            "Type": "UPDATE",
            "Table": "users",
            "ServerId": null,
            "BinLogPosition": 2111
        },
     ....
]

masking logic

df['ServerId'] = [fake.pyint() if not(pd.isna(df['ServerId'][index])) else np.nan for index in range(len(df['ServerId']))]

The challenge is, the output "ServerId" should contain only integers but unfortunately it contains floats.

df['ServerId']
0     9590.0
1        NaN
2     1779.0
3     1303.0
4        NaN

I found a answer to this problem, to use 'Int64'

df['ServerId'] = df['ServerId'].astype('Int64')
0     8920
1     <NA>
2     9148
3     2434
4     <NA>

However using 'Int64', it converts NaN to NA and while writing back to json, i get an error as,

TypeError: Object of type NAType is not JSON serializable

with gzip.open(outputFile, 'w') as outfile:
    outfile.write(json.dumps(json_objects_list).encode('utf-8'))

Is it possible to keep NaN after converting to 'Int64' data type? If this is not possible, how can i fix the error?

Solution

Indeed, Pandas NA and NaT are not JSON serialisable by the built-in Python json library.

But the Pandas DataFrame to_json() method will handle those values for you and convert them to JSON null.

from pandas import DataFrame, Series, NA, NaT

df = DataFrame({"ServerId" : Series([8920, NA, 9148, 2434, NA], dtype="Int64") })
s = df.to_json()

# -> {"ServerId":{"0":8920,"1":null,"2":9148,"3":2434,"4":null}}