Search code examples
pythonjsonpandasapinormalize

how to convert json to a dataframe using json_normalize?


I'm trying to covert an api response from json to a dataframe in pandas. the problem I am having is that de data is nested in the json format and I am not getting the right columns in my dataframe.

The data is collect from a api with the following format:

{
    "data": [
        {
            "timestamp": "2019-04-10T11:40:13.437Z",
            "score": 87,
            "sensors": [
                {
                    "comp": "temp",
                    "value": 20.010000228881836
                },
                {
                    "comp": "humid",
                    "value": 34.4900016784668
                },
                {
                    "comp": "co2",
                    "value": 418
                },
                {
                    "comp": "voc",
                    "value": 166
                },
                {
                    "comp": "pm25",
                    "value": 4
                },
                {
                    "comp": "lux",
                    "value": 961.4000244140625
                },
                {
                    "comp": "spl_a",
                    "value": 45.70000076293945
                }
            ],
            "indices": [
                {
                    "comp": "temp",
                    "value": -1
                },
                {
                    "comp": "humid",
                    "value": -2
                },
                {
                    "comp": "co2",
                    "value": 0
                },
                {
                    "comp": "voc",
                    "value": 0
                },
                {
                    "comp": "pm25",
                    "value": 0
                }
            ]
        },
        {
            "timestamp": "2019-04-10T11:40:03.413Z",
            "score": 87,
            "sensors": [
                {
                    "comp": "temp",
                    "value": 20.040000915527344
                },
                {
                    "comp": "humid",
                    "value": 34.630001068115234
                },
                {
                    "comp": "co2",
                    "value": 418
                },
                {
                    "comp": "voc",
                    "value": 169
                },
                {
                    "comp": "pm25",
                    "value": 5
                },
                {
                    "comp": "lux",
                    "value": 960.2000122070312
                },
                {
                    "comp": "spl_a",
                    "value": 46
                }
            ],
            "indices": [
                {
                    "comp": "temp",
                    "value": -1
                },
                {
                    "comp": "humid",
                    "value": -1
                },
                {
                    "comp": "co2",
                    "value": 0
                },
                {
                    "comp": "voc",
                    "value": 0
                },
                {
                    "comp": "pm25",
                    "value": 0
                }
            ]
        },

edit as you can see it are more sets of data

What I already tried is the following: I have converted the JSON format into a dict and then I normalized it with the normalize function. see the code as following:

data = r.json() 

works_data = json_normalize(data=data['data'], record_path=['sensors'],meta=['timestamp'])

df = pd.DataFrame.from_dict(works_data)
print(df)

the result I am getting is:

 comp        value                 timestamp
0      temp    21.059999  2019-04-10T12:39:05.062Z
1     humid    31.250000  2019-04-10T12:39:05.062Z
2       co2   407.000000  2019-04-10T12:39:05.062Z
3       voc   136.000000  2019-04-10T12:39:05.062Z
4      pm25     3.000000  2019-04-10T12:39:05.062Z
5       lux  1302.099976  2019-04-10T12:39:05.062Z
6     spl_a    46.299999  2019-04-10T12:39:05.062Z

the result I need is as following: the result

is there somebody who can help me?


Solution

  • You can reshape your works_data:

    data = {
        "data": [
            {
                "timestamp": "2019-04-10T11:40:13.437Z",
                "score": 87,
                "sensors": [
                    {
                        "comp": "temp",
                        "value": 20.010000228881836
                    },
                    {
                        "comp": "humid",
                        "value": 34.4900016784668
                    },
                    {
                        "comp": "co2",
                        "value": 418
                    },
                    {
                        "comp": "voc",
                        "value": 166
                    },
                    {
                        "comp": "pm25",
                        "value": 4
                    },
                    {
                        "comp": "lux",
                        "value": 961.4000244140625
                    },
                    {
                        "comp": "spl_a",
                        "value": 45.70000076293945
                    }
                ],
                "indices": [
                    {
                        "comp": "temp",
                        "value": -1
                    },
                    {
                        "comp": "humid",
                        "value": -2
                    },
                    {
                        "comp": "co2",
                        "value": 0
                    },
                    {
                        "comp": "voc",
                        "value": 0
                    },
                    {
                        "comp": "pm25",
                        "value": 0
                    }
                ]
            },
            {
                "timestamp": "2019-04-10T11:40:03.413Z",
                "score": 87,
                "sensors": [
                    {
                        "comp": "temp",
                        "value": 20.040000915527344
                    },
                    {
                        "comp": "humid",
                        "value": 34.630001068115234
                    },
                    {
                        "comp": "co2",
                        "value": 418
                    },
                    {
                        "comp": "voc",
                        "value": 169
                    },
                    {
                        "comp": "pm25",
                        "value": 5
                    },
                    {
                        "comp": "lux",
                        "value": 960.2000122070312
                    },
                    {
                        "comp": "spl_a",
                        "value": 46
                    }
                ],
                "indices": [
                    {
                        "comp": "temp",
                        "value": -1
                    },
                    {
                        "comp": "humid",
                        "value": -1
                    },
                    {
                        "comp": "co2",
                        "value": 0
                    },
                    {
                        "comp": "voc",
                        "value": 0
                    },
                    {
                        "comp": "pm25",
                        "value": 0
                    }
                ]
            }]}
    
    
    from pandas.io.json import json_normalize
    import pandas as pd        
    
    df = pd.DataFrame()
    for each in data['data']:
        timestamp = each['timestamp']
        temp_df = json_normalize(data=each, record_path=['sensors']).T
    
        columns = list(temp_df.iloc[0])
        data_values = list(temp_df.iloc[1,:])
    
        temp_df = pd.DataFrame([data_values + [timestamp]], columns=columns + ['timestamp'])
    
        df = df.append(temp_df).reset_index(drop=True)
    
    
    
    print(df)
    

    Output:

    print(df)
            temp      humid    co2  ...         lux      spl_a                 timestamp
    0  20.010000  34.490002  418.0  ...  961.400024  45.700001  2019-04-10T11:40:13.437Z
    1  20.040001  34.630001  418.0  ...  960.200012  46.000000  2019-04-10T11:40:03.413Z
    
    [2 rows x 8 columns]