This issue is about exporting the dataframe with float64 datatype, using the to_json()
function from Pandas. The source code is attached below.
import pandas
if __name__ == "__main__":
d = {'col1': [11111111.84, 123456.55], 'col2': [3, 4]}
df = pandas.DataFrame(data=d)
print(df)
print(df.dtypes)
output_file_path = '/test.csv'
df.to_csv(output_file_path, index=False, encoding='UTF-8')
output_file_path = '/test.json'
df.to_json(output_file_path, orient="records", lines=True)
The output from print()
function is correct before exporting the dataframe into JSON or CSV file. The output is shown below.
col1 col2
0 11111111.84 3
1 123456.55 4
col1 float64
col2 int64
dtype: object
The exported data in CSV format (test.csv) is correct as it should be.
The exported data in JSON format (test.json) has the incorrect decimal points as shown below in col1 row1 (11111111.8399999999). This issue only occurs for some values because col1 row2 is correct (123456.55).
I found that there is a workaround to fix this issue by specifying another argument double_precision
for to_json()
function. The result becomes correct! (already tested.)
Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html
However, by specifying double_precision
argument, it might limit the number of decimal points to all columns. It is not a good approach when each data column requires different numbers of decimal points.
Also, found the topic below but not sure if it is related to this issue of mine.
Link: What is the difference between NUMERIC and FLOAT in BigQuery?
I am trying to understand the root cause of this issue and look for the solution. This is quite weird that the issue only happens for to_json()
function, but to_csv()
function works.
Anyone please help!
pandas to_json
might be doing something weird with the precision there. As you've explained, the canonical solution is to specify double_precision
with your desired precision, but this doesn't allow you to selectively round specific columns to a desired precision.
Another option is to cut out the middleman df.to_json
here and instead use python's builtin json.dump
:
import json
# convert to string
json.dumps(df.to_dict())
# '{"col1": {"0": 11111111.84, "1": 123456.55}, "col2": {"0": 3, "1": 4}}'
# save as a file
json.dump(df.to_dict(), f) # f is an open fileobj
As you can see, this doesn't muck around with the precision. Standard floating point caveats still apply.