python json pandas dataframe decimal-point

Pandas dataframe with float64 incorrectly exported by to_json() function

This issue is about exporting the dataframe with float64 datatype, using the to_json() function from Pandas. The source code is attached below.

import pandas

if __name__ == "__main__":
    d = {'col1': [11111111.84, 123456.55], 'col2': [3, 4]}
    df = pandas.DataFrame(data=d)

    print(df)
    print(df.dtypes)

    output_file_path = '/test.csv'
    df.to_csv(output_file_path, index=False, encoding='UTF-8')
    output_file_path = '/test.json'
    df.to_json(output_file_path, orient="records", lines=True)

The output from print() function is correct before exporting the dataframe into JSON or CSV file. The output is shown below.

          col1  col2
0  11111111.84     3
1    123456.55     4
col1    float64
col2      int64
dtype: object

The exported data in CSV format (test.csv) is correct as it should be.

The exported data in JSON format (test.json) has the incorrect decimal points as shown below in col1 row1 (11111111.8399999999). This issue only occurs for some values because col1 row2 is correct (123456.55).

I found that there is a workaround to fix this issue by specifying another argument double_precision for to_json() function. The result becomes correct! (already tested.)

Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

However, by specifying double_precision argument, it might limit the number of decimal points to all columns. It is not a good approach when each data column requires different numbers of decimal points.

Also, found the topic below but not sure if it is related to this issue of mine.

Link: What is the difference between NUMERIC and FLOAT in BigQuery?

I am trying to understand the root cause of this issue and look for the solution. This is quite weird that the issue only happens for to_json() function, but to_csv() function works.

Anyone please help!

Solution

pandas to_json might be doing something weird with the precision there. As you've explained, the canonical solution is to specify double_precision with your desired precision, but this doesn't allow you to selectively round specific columns to a desired precision.

Another option is to cut out the middleman df.to_json here and instead use python's builtin json.dump:

import json

# convert to string
json.dumps(df.to_dict()) 
# '{"col1": {"0": 11111111.84, "1": 123456.55}, "col2": {"0": 3, "1": 4}}'  

# save as a file
json.dump(df.to_dict(), f)  # f is an open fileobj

As you can see, this doesn't muck around with the precision. Standard floating point caveats still apply.