Search code examples
pythonpandasjupyter

Pandas 2.0.3? Problems keeping format when file is saved in .json or .csv format


Here is some random code.

# create df
import pandas as pd
df2 = pd.DataFrame({'var1':['1_0','1_0','1_0','1_0','1_0'],
                   'var2':['X','y','a','a','a']})
df2.to_json('df2.json')

# import df
df2 = pd.read_json('df2.json')
df2

This is the expected output:

    var1    var2
0   1_0      X
1   1_0      y
2   1_0      a
3   1_0      a
4   1_0      a

However it generates:

    var1    var2
0   10      X
1   10      y
2   10      a
3   10      a
4   10      a

If I modify an entry inside ['var1'] to a string, then the code it generates when df is imported is correct.

Here is an example to illustrate it. I replaced one of the entries with 'hello'

df2 = pd.DataFrame({'var1':['1_0','hello','1_0','1_0','1_0'],
                   'var2':['X','y','a','a','a']})
df2.to_json('df2.json')

# import df
df2 = pd.read_json('df2.json')
df2

Generates this

    var1    var2
0   1_0     X
1   hello   y
2   1_0     a
3   1_0     a
4   1_0     a

Same problem is observed if file is saved in csv format and then imported.

Has anyone encountered the same issue?


Solution

  • This is due to the fact that underscores are valid separators in python (often used as thousand separator: 1_000 is 1000). You could force the dtype upon import (or use dtype=False):

    df2 = pd.read_json('df2.json', dtype='str')
    

    If you want to keep dtype detection for the other columns:

    df2 = pd.read_json('df2.json', dtype={'var1': 'str'})
    

    Output:

      var1 var2
    0  1_0    X
    1  1_0    y
    2  1_0    a
    3  1_0    a
    4  1_0    a
    

    When you have a string in the json, there is no ambiguity that the values are not numbers and the conversion is not done.