Pandas | JSON Normalize | Deeply Nested Sections

I have a data structure which is a very complex/nested JSON.

After reading other solutions on SO, I am comfortable with the fact that I may need to make multiple DFs and merge them together (e.g json_normalize with multiple record paths).

{
  "topLevelIdentifier": "a86507",
  "activities": [
    {
      "activityId": "1f13d848",
      "taskList": [
        {
          "taskIdentifier": "c6ba-1012",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "3438"
        },
        {
          "taskIdentifier": "cc48-07bf",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "4b32"
        },
        {
          "taskIdentifier": "35b896",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "7bcca"
        }
      ],
      "sectionB": {
        "someCapacityA": {
          "totalVolume": {
            "value": 8415,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 1059.0,
            "unit": "GRAMS"
          }
        },
        "someCapacityB": {
          "totalVolume": {
            "value": 0.0,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 0.0,
            "unit": "GRAMS"
          }
        }
      },
 "sectionC": {....},
"sectionD": {....}, 
"sectonE":{...}
}]}

I can create a base for the merges by doing this:

with open('my_file.json') as simulation_file:    
    simulation_data = json.load(simulation_file) 
df_base = pd.json_normalize(simulation_data, 'activities',['topLevelIdentifier'])

Then normalizing the next part isn't too bad:

df_taskList = pd.json_normalize(simulation_data['activities'],
                               'taskList',['activityId'])

It's normalizing the next section that has me stuck. Here's what I'm trying but is throwing an error:

df_sectionB = pd.json_normalize(simulation_data['activities'],
                                
                                'activityId',['taskList',['taskIdentifier',['taskTime',['unit', 'value']], 'taskLocationIdentifier']])

I'm basically trying to flatten out sectionBsuch that it has the activityId so that I can merge all the pieces together.

If there's a simpler way to flatten out to the deepest layer in a complex format like this, I'm open to alternative approaches!

Solution

You could add .explode("taskList") to the result of the first normalize.

You can then normalize the taskList column, copy the base index, and concat the results.

df_base = pd.json_normalize(
   simulation_data, 
   record_path="activities", 
   meta="topLevelIdentifier"
).explode("taskList")

df = pd.concat(
   [ 
      df_base.drop(columns="taskList"),
      pd.json_normalize(df_base["taskList"]).set_index(df_base.index)
   ],
   axis=1
)

>>> df.T
                                                   0           0           0
activityId                                  1f13d848    1f13d848    1f13d848
sectionC                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionD                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionE                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionB.someCapacityA.totalVolume.value        8415        8415        8415
sectionB.someCapacityA.totalVolume.unit        CU_CM       CU_CM       CU_CM
sectionB.someCapacityA.totalWeight.value      1059.0      1059.0      1059.0
sectionB.someCapacityA.totalWeight.unit        GRAMS       GRAMS       GRAMS
sectionB.someCapacityB.totalVolume.value         0.0         0.0         0.0
sectionB.someCapacityB.totalVolume.unit        CU_CM       CU_CM       CU_CM
sectionB.someCapacityB.totalWeight.value         0.0         0.0         0.0
sectionB.someCapacityB.totalWeight.unit        GRAMS       GRAMS       GRAMS
topLevelIdentifier                            a86507      a86507      a86507
taskIdentifier                             c6ba-1012   cc48-07bf      35b896
taskLocationIdentifier                          3438        4b32       7bcca
taskTime.value                                     0           0           0
taskTime.unit                                SECONDS     SECONDS     SECONDS