I have a data structure which is a very complex/nested JSON.
After reading other solutions on SO, I am comfortable with the fact that I may need to make multiple DFs and merge them together (e.g json_normalize with multiple record paths).
{
"topLevelIdentifier": "a86507",
"activities": [
{
"activityId": "1f13d848",
"taskList": [
{
"taskIdentifier": "c6ba-1012",
"taskTime": {
"value": 0,
"unit": "SECONDS"
},
"taskLocationIdentifier": "3438"
},
{
"taskIdentifier": "cc48-07bf",
"taskTime": {
"value": 0,
"unit": "SECONDS"
},
"taskLocationIdentifier": "4b32"
},
{
"taskIdentifier": "35b896",
"taskTime": {
"value": 0,
"unit": "SECONDS"
},
"taskLocationIdentifier": "7bcca"
}
],
"sectionB": {
"someCapacityA": {
"totalVolume": {
"value": 8415,
"unit": "CU_CM"
},
"totalWeight": {
"value": 1059.0,
"unit": "GRAMS"
}
},
"someCapacityB": {
"totalVolume": {
"value": 0.0,
"unit": "CU_CM"
},
"totalWeight": {
"value": 0.0,
"unit": "GRAMS"
}
}
},
"sectionC": {....},
"sectionD": {....},
"sectonE":{...}
}]}
I can create a base for the merges by doing this:
with open('my_file.json') as simulation_file:
simulation_data = json.load(simulation_file)
df_base = pd.json_normalize(simulation_data, 'activities',['topLevelIdentifier'])
Then normalizing the next part isn't too bad:
df_taskList = pd.json_normalize(simulation_data['activities'],
'taskList',['activityId'])
It's normalizing the next section that has me stuck. Here's what I'm trying but is throwing an error:
df_sectionB = pd.json_normalize(simulation_data['activities'],
'activityId',['taskList',['taskIdentifier',['taskTime',['unit', 'value']], 'taskLocationIdentifier']])
I'm basically trying to flatten out sectionB
such that it has the activityId
so that I can merge all the pieces together.
If there's a simpler way to flatten out to the deepest layer in a complex format like this, I'm open to alternative approaches!
You could add .explode("taskList")
to the result of the first normalize.
You can then normalize the taskList
column, copy the base index, and concat the results.
df_base = pd.json_normalize(
simulation_data,
record_path="activities",
meta="topLevelIdentifier"
).explode("taskList")
df = pd.concat(
[
df_base.drop(columns="taskList"),
pd.json_normalize(df_base["taskList"]).set_index(df_base.index)
],
axis=1
)
>>> df.T
0 0 0
activityId 1f13d848 1f13d848 1f13d848
sectionC {Ellipsis} {Ellipsis} {Ellipsis}
sectionD {Ellipsis} {Ellipsis} {Ellipsis}
sectionE {Ellipsis} {Ellipsis} {Ellipsis}
sectionB.someCapacityA.totalVolume.value 8415 8415 8415
sectionB.someCapacityA.totalVolume.unit CU_CM CU_CM CU_CM
sectionB.someCapacityA.totalWeight.value 1059.0 1059.0 1059.0
sectionB.someCapacityA.totalWeight.unit GRAMS GRAMS GRAMS
sectionB.someCapacityB.totalVolume.value 0.0 0.0 0.0
sectionB.someCapacityB.totalVolume.unit CU_CM CU_CM CU_CM
sectionB.someCapacityB.totalWeight.value 0.0 0.0 0.0
sectionB.someCapacityB.totalWeight.unit GRAMS GRAMS GRAMS
topLevelIdentifier a86507 a86507 a86507
taskIdentifier c6ba-1012 cc48-07bf 35b896
taskLocationIdentifier 3438 4b32 7bcca
taskTime.value 0 0 0
taskTime.unit SECONDS SECONDS SECONDS