Search code examples
pythonpandasdataframenormalize

Normalize JSON data to Pandas DataFrame where columns and values are in lists


I have the following complex JSON data that needs to be normalised to a Pandas DataFrame.

{
    "notice": {
        "copyright": "Copyright Info",
        "copyright_url": "This_is_URL"
    },
    "product_info": {
        "refresh message": "Issued at 09:36 UTC Saturday 01/10/22 October 2022",
        "issuance_time": "20221001T0936Z",
        "product_name": "24hr historic data",
        "ID": "XXXXYYYYZZZZ"
    },
    "data": [
        {
            "station_info": {
                "wmo_id": 95214,
                "station_name": "STATION 1",
                "station_height": 3.8,
                "station_type": "AWS"
            },
            "observation_data": {
                "columns": [
                    "temp",
                    "dewt",
                    "rh",
                    "wnd_spd_kmh"
                ],
                "data": [
                    [
                        27.2,
                        23.7,
                        81.0,
                        26.0
                    ],
                    [
                        27.3,
                        23.5,
                        80.0,
                        28.0
                    ],
                    [
                        27.4,
                        23.2,
                        78.0,
                        30.0
                    ]
                ]
            }
        },
        {
            "station_info": {
                "wmo_id": 95215,
                "station_name": "STATION 2",
                "station_height": 3.5,
                "station_type": "AWS"
            },
            "observation_data": {
                "columns": [
                    "temp",
                    "dewt",
                    "rh",
                    "wnd_spd_kmh"
                ],
                "data": [
                    [
                        24.2,
                        25.7,
                        82.1,
                        21.0
                    ],
                    [
                        24.3,
                        25.8,
                        79.6,
                        22.0
                    ],
                    [
                        24.4,
                        25.9,
                        78.3,
                        16.0
                    ]
                ]
            }
        }
    ]
}

The columns of the expected DataFrame is in a list in the JSON as "columns". So are the actual "data".

What is expected to output is as below:

enter image description here

What I have attempted:

with gzip.open(json_file_path, "r") as f:
        data = f.read()
        j = json.loads (data.decode('utf-8'))
national_data = pd.json_normalize(j['data'])

However, the whole "columns" list was converted to a cell value.

        station_info.wmo_id    station_info.station_name         observation_data.columns         observation_data.data
    0   95214                  STATION 1                         [temp, dewt, rh, wnd_spd_kmh]     [[27.2, 23.7, 81.0, 26.0],[...],[...]]
    1   95215                  STATION 2                         [temp, dewt, rh, wnd_spd_kmh]     [[24.2, 25.7, 82.1, 21.0],[...],[...]]

Solution

  • I don't think json_normalize alone will be able to achieve the output you wanted. You can achieve this by adding explode (to expand the arrays generated by json_normalize) and pivot (to transform the dataframe from long to wide format).

    First step is to flatten the observation_data.data and reset the index (using explode and reset_index) to be able to know or organize group of records as one observation. From here we can again perform another explode (including both the columns and data) as shown below.

    Last step now is to pivot the resulting dataframe using the index of the first explode as our unique anchor to transform from long to wide format.

    >>> data_exploded = national_data.explode('observation_data.data').reset_index(drop=True).explode(['observation_data.columns','observation_data.data']).reset_index()
    >>> data_exploded
        index  station_info.wmo_id station_info.station_name observation_data.columns observation_data.data
    0       0                95214                 STATION 1                     temp                  27.2
    1       0                95214                 STATION 1                     dewt                  23.7
    2       0                95214                 STATION 1                       rh                  81.0
    3       0                95214                 STATION 1              wnd_spd_kmh                  26.0
    4       1                95214                 STATION 1                     temp                  27.3
    5       1                95214                 STATION 1                     dewt                  23.5
    6       1                95214                 STATION 1                       rh                  80.0
    7       1                95214                 STATION 1              wnd_spd_kmh                  28.0
    8       2                95214                 STATION 1                     temp                  27.4
    9       2                95214                 STATION 1                     dewt                  23.2
    10      2                95214                 STATION 1                       rh                  78.0
    11      2                95214                 STATION 1              wnd_spd_kmh                  30.0
    12      3                95215                 STATION 2                     temp                  24.2
    13      3                95215                 STATION 2                     dewt                  25.7
    14      3                95215                 STATION 2                       rh                  82.1
    15      3                95215                 STATION 2              wnd_spd_kmh                  21.0
    16      4                95215                 STATION 2                     temp                  24.3
    17      4                95215                 STATION 2                     dewt                  25.8
    18      4                95215                 STATION 2                       rh                  79.6
    19      4                95215                 STATION 2              wnd_spd_kmh                  22.0
    20      5                95215                 STATION 2                     temp                  24.4
    21      5                95215                 STATION 2                     dewt                  25.9
    22      5                95215                 STATION 2                       rh                  78.3
    23      5                95215                 STATION 2              wnd_spd_kmh                  16.0
    >>> data_exploded.pivot(index=['index', 'station_info.wmo_id', 'station_info.station_name'],columns='observation_data.columns',values='observation_data.data').reset_index().drop(columns=['index'])
    observation_data.columns  station_info.wmo_id station_info.station_name  dewt    rh  temp wnd_spd_kmh
    0                                       95214                 STATION 1  23.7  81.0  27.2        26.0
    1                                       95214                 STATION 1  23.5  80.0  27.3        28.0
    2                                       95214                 STATION 1  23.2  78.0  27.4        30.0
    3                                       95215                 STATION 2  25.7  82.1  24.2        21.0
    4                                       95215                 STATION 2  25.8  79.6  24.3        22.0
    5                                       95215                 STATION 2  25.9  78.3  24.4        16.0