Search code examples
pythonpandasjson-normalize

Why is pandas.read_json, modifying the value of long integers?


I don't know why the original content of id_1 & id_2 changes when I print it.

I have a json file named test_data.json

{
"objects":{
    "value":{
        "1298543947669573634":{
            "timestamp":"Wed Aug 26 08:52:57 +0000 2020",
            "id_1":"1298543947669573634",
            "id_2":"1298519559306190850"
            }
        }
    }
}

Output

python test_data.py 
                  id_1                 id_2                 timestamp
0  1298543947669573632  1298519559306190848 2020-08-26 08:52:57+00:00

My code named test_data.py is

import pandas as pd
import json

file = "test_data.json"
with open (file, "r")  as f:
    all_data = json.loads(f.read()) 
data = pd.read_json(json.dumps(all_data['objects']['value']), orient='index')
data = data.reset_index(drop=True)
print(data.head())

How can I fix this, so the numeric values are interpreted correctly?


Solution

    • Using python 3.8.5 and pandas 1.1.1

    Current Implementation

    • First, the code reads the file in and converts it from a str type to a dict, with json.loads
    with open (file, "r")  as f:
        all_data = json.loads(f.read()) 
    
    • Then 'value' is converted back to a str
    json.dumps(all_data['objects']['value'])
    
    pd.read_json(json.dumps(all_data['objects']['value']), orient='index')
    

    Updated code

    Option 1

    • Use pandas.DataFrame.from_dict and then convert to numeric.
    file = "test_data.json"
    with open (file, "r")  as f:
        all_data = json.loads(f.read()) 
    
    # use .from_dict
    data = pd.DataFrame.from_dict(all_data['objects']['value'], orient='index')
    
    # convert columns to numeric
    data[['id_1', 'id_2']] = data[['id_1', 'id_2']].apply(pd.to_numeric, errors='coerce')
    
    data = data.reset_index(drop=True)
    
    # display(data)
                            timestamp                 id_1                 id_2
    0  Wed Aug 26 08:52:57 +0000 2020  1298543947669573634  1298519559306190850
    
    print(data.info())
    [out]:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1 entries, 0 to 0
    Data columns (total 3 columns):
     #   Column     Non-Null Count  Dtype 
    ---  ------     --------------  ----- 
     0   timestamp  1 non-null      object
     1   id_1       1 non-null      int64 
     2   id_2       1 non-null      int64 
    dtypes: int64(2), object(1)
    memory usage: 152.0+ bytes
    

    Option 2

    • Use pandas.json_normalize and then convert columns to numeric.
    file = "test_data.json"
    with open (file, "r")  as f:
        all_data = json.loads(f.read()) 
    
    # read all_data into a dataframe
    df = pd.json_normalize(all_data['objects']['value'])
    
    # rename the columns
    df.columns = [x.split('.')[1] for x in df.columns]
    
    # convert to numeric
    df[['id_1', 'id_2']] = df[['id_1', 'id_2']].apply(pd.to_numeric, errors='coerce')
    
    # display(df)
                            timestamp                 id_1                 id_2
    0  Wed Aug 26 08:52:57 +0000 2020  1298543947669573634  1298519559306190850
    
    print(df.info()
    [out]:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1 entries, 0 to 0
    Data columns (total 3 columns):
     #   Column     Non-Null Count  Dtype 
    ---  ------     --------------  ----- 
     0   timestamp  1 non-null      object
     1   id_1       1 non-null      int64 
     2   id_2       1 non-null      int64 
    dtypes: int64(2), object(1)
    memory usage: 152.0+ bytes