Thank you in advance for your help.
My python code reads json input file and loads the data into a data frame, masks or changes on the data frame column specified by configuration and in the last stage, creates json output file.
read json into data frame --> mask/change the df column ---> generate json
Input json:
[
{
"BinLogFilename": "mysql.log",
"Type": "UPDATE",
"Table": "users",
"ServerId": 1,
"BinLogPosition": 2111
}, {
{ "BinLogFilename": "mysql.log",
"Type": "UPDATE",
"Table": "users",
"ServerId": null,
"BinLogPosition": 2111
},
...
]
when I load the above json into data frame, the data frame column "ServerId" has float values because it has null in few blocks of json input.
The main central logic converts/fakes "ServerId" into another number, however the output contains float numbers.
Output json:
[
{
"BinLogFilename": "mysql.log",
"Type": "UPDATE",
"Table": "users",
"ServerId": 5627.0,
"BinLogPosition": 2111
},
{
"BinLogFilename": "mysql.log",
"Type": "UPDATE",
"Table": "users",
"ServerId": null,
"BinLogPosition": 2111
},
....
]
masking logic
df['ServerId'] = [fake.pyint() if not(pd.isna(df['ServerId'][index])) else np.nan for index in range(len(df['ServerId']))]
The challenge is, the output "ServerId" should contain only integers but unfortunately it contains floats.
df['ServerId']
0 9590.0
1 NaN
2 1779.0
3 1303.0
4 NaN
I found a answer to this problem, to use 'Int64'
df['ServerId'] = df['ServerId'].astype('Int64')
0 8920
1 <NA>
2 9148
3 2434
4 <NA>
However using 'Int64', it converts NaN to NA and while writing back to json, i get an error as,
TypeError: Object of type NAType is not JSON serializable
with gzip.open(outputFile, 'w') as outfile:
outfile.write(json.dumps(json_objects_list).encode('utf-8'))
Is it possible to keep NaN after converting to 'Int64' data type? If this is not possible, how can i fix the error?
Indeed, Pandas NA and NaT are not JSON serialisable by the built-in Python json library.
But the Pandas DataFrame to_json() method will handle those values for you and convert them to JSON null.
from pandas import DataFrame, Series, NA, NaT
df = DataFrame({"ServerId" : Series([8920, NA, 9148, 2434, NA], dtype="Int64") })
s = df.to_json()
# -> {"ServerId":{"0":8920,"1":null,"2":9148,"3":2434,"4":null}}