I have the following complex JSON data that needs to be normalised to a Pandas DataFrame.
{
"notice": {
"copyright": "Copyright Info",
"copyright_url": "This_is_URL"
},
"product_info": {
"refresh message": "Issued at 09:36 UTC Saturday 01/10/22 October 2022",
"issuance_time": "20221001T0936Z",
"product_name": "24hr historic data",
"ID": "XXXXYYYYZZZZ"
},
"data": [
{
"station_info": {
"wmo_id": 95214,
"station_name": "STATION 1",
"station_height": 3.8,
"station_type": "AWS"
},
"observation_data": {
"columns": [
"temp",
"dewt",
"rh",
"wnd_spd_kmh"
],
"data": [
[
27.2,
23.7,
81.0,
26.0
],
[
27.3,
23.5,
80.0,
28.0
],
[
27.4,
23.2,
78.0,
30.0
]
]
}
},
{
"station_info": {
"wmo_id": 95215,
"station_name": "STATION 2",
"station_height": 3.5,
"station_type": "AWS"
},
"observation_data": {
"columns": [
"temp",
"dewt",
"rh",
"wnd_spd_kmh"
],
"data": [
[
24.2,
25.7,
82.1,
21.0
],
[
24.3,
25.8,
79.6,
22.0
],
[
24.4,
25.9,
78.3,
16.0
]
]
}
}
]
}
The columns of the expected DataFrame is in a list in the JSON as "columns". So are the actual "data".
What is expected to output is as below:
What I have attempted:
with gzip.open(json_file_path, "r") as f:
data = f.read()
j = json.loads (data.decode('utf-8'))
national_data = pd.json_normalize(j['data'])
However, the whole "columns" list was converted to a cell value.
station_info.wmo_id station_info.station_name observation_data.columns observation_data.data
0 95214 STATION 1 [temp, dewt, rh, wnd_spd_kmh] [[27.2, 23.7, 81.0, 26.0],[...],[...]]
1 95215 STATION 2 [temp, dewt, rh, wnd_spd_kmh] [[24.2, 25.7, 82.1, 21.0],[...],[...]]
I don't think json_normalize alone will be able to achieve the output you wanted. You can achieve this by adding explode (to expand the arrays generated by json_normalize) and pivot (to transform the dataframe from long to wide format).
First step is to flatten the observation_data.data and reset the index (using explode and reset_index) to be able to know or organize group of records as one observation. From here we can again perform another explode (including both the columns and data) as shown below.
Last step now is to pivot the resulting dataframe using the index of the first explode as our unique anchor to transform from long to wide format.
>>> data_exploded = national_data.explode('observation_data.data').reset_index(drop=True).explode(['observation_data.columns','observation_data.data']).reset_index()
>>> data_exploded
index station_info.wmo_id station_info.station_name observation_data.columns observation_data.data
0 0 95214 STATION 1 temp 27.2
1 0 95214 STATION 1 dewt 23.7
2 0 95214 STATION 1 rh 81.0
3 0 95214 STATION 1 wnd_spd_kmh 26.0
4 1 95214 STATION 1 temp 27.3
5 1 95214 STATION 1 dewt 23.5
6 1 95214 STATION 1 rh 80.0
7 1 95214 STATION 1 wnd_spd_kmh 28.0
8 2 95214 STATION 1 temp 27.4
9 2 95214 STATION 1 dewt 23.2
10 2 95214 STATION 1 rh 78.0
11 2 95214 STATION 1 wnd_spd_kmh 30.0
12 3 95215 STATION 2 temp 24.2
13 3 95215 STATION 2 dewt 25.7
14 3 95215 STATION 2 rh 82.1
15 3 95215 STATION 2 wnd_spd_kmh 21.0
16 4 95215 STATION 2 temp 24.3
17 4 95215 STATION 2 dewt 25.8
18 4 95215 STATION 2 rh 79.6
19 4 95215 STATION 2 wnd_spd_kmh 22.0
20 5 95215 STATION 2 temp 24.4
21 5 95215 STATION 2 dewt 25.9
22 5 95215 STATION 2 rh 78.3
23 5 95215 STATION 2 wnd_spd_kmh 16.0
>>> data_exploded.pivot(index=['index', 'station_info.wmo_id', 'station_info.station_name'],columns='observation_data.columns',values='observation_data.data').reset_index().drop(columns=['index'])
observation_data.columns station_info.wmo_id station_info.station_name dewt rh temp wnd_spd_kmh
0 95214 STATION 1 23.7 81.0 27.2 26.0
1 95214 STATION 1 23.5 80.0 27.3 28.0
2 95214 STATION 1 23.2 78.0 27.4 30.0
3 95215 STATION 2 25.7 82.1 24.2 21.0
4 95215 STATION 2 25.8 79.6 24.3 22.0
5 95215 STATION 2 25.9 78.3 24.4 16.0