Search code examples
jsonpandasjson-normalize

Pandas | JSON Normalize | Deeply Nested Sections


I have a data structure which is a very complex/nested JSON.

After reading other solutions on SO, I am comfortable with the fact that I may need to make multiple DFs and merge them together (e.g json_normalize with multiple record paths).

{
  "topLevelIdentifier": "a86507",
  "activities": [
    {
      "activityId": "1f13d848",
      "taskList": [
        {
          "taskIdentifier": "c6ba-1012",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "3438"
        },
        {
          "taskIdentifier": "cc48-07bf",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "4b32"
        },
        {
          "taskIdentifier": "35b896",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "7bcca"
        }
      ],
      "sectionB": {
        "someCapacityA": {
          "totalVolume": {
            "value": 8415,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 1059.0,
            "unit": "GRAMS"
          }
        },
        "someCapacityB": {
          "totalVolume": {
            "value": 0.0,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 0.0,
            "unit": "GRAMS"
          }
        }
      },
 "sectionC": {....},
"sectionD": {....}, 
"sectonE":{...}
}]}

I can create a base for the merges by doing this:

with open('my_file.json') as simulation_file:    
    simulation_data = json.load(simulation_file) 
df_base = pd.json_normalize(simulation_data, 'activities',['topLevelIdentifier'])

Then normalizing the next part isn't too bad:

df_taskList = pd.json_normalize(simulation_data['activities'],
                               'taskList',['activityId'])

It's normalizing the next section that has me stuck. Here's what I'm trying but is throwing an error:

df_sectionB = pd.json_normalize(simulation_data['activities'],
                                
                                'activityId',['taskList',['taskIdentifier',['taskTime',['unit', 'value']], 'taskLocationIdentifier']])

I'm basically trying to flatten out sectionBsuch that it has the activityId so that I can merge all the pieces together.

If there's a simpler way to flatten out to the deepest layer in a complex format like this, I'm open to alternative approaches!


Solution

  • You could add .explode("taskList") to the result of the first normalize.

    You can then normalize the taskList column, copy the base index, and concat the results.

    df_base = pd.json_normalize(
       simulation_data, 
       record_path="activities", 
       meta="topLevelIdentifier"
    ).explode("taskList")
    
    df = pd.concat(
       [ 
          df_base.drop(columns="taskList"),
          pd.json_normalize(df_base["taskList"]).set_index(df_base.index)
       ],
       axis=1
    )
    
    >>> df.T
                                                       0           0           0
    activityId                                  1f13d848    1f13d848    1f13d848
    sectionC                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
    sectionD                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
    sectionE                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
    sectionB.someCapacityA.totalVolume.value        8415        8415        8415
    sectionB.someCapacityA.totalVolume.unit        CU_CM       CU_CM       CU_CM
    sectionB.someCapacityA.totalWeight.value      1059.0      1059.0      1059.0
    sectionB.someCapacityA.totalWeight.unit        GRAMS       GRAMS       GRAMS
    sectionB.someCapacityB.totalVolume.value         0.0         0.0         0.0
    sectionB.someCapacityB.totalVolume.unit        CU_CM       CU_CM       CU_CM
    sectionB.someCapacityB.totalWeight.value         0.0         0.0         0.0
    sectionB.someCapacityB.totalWeight.unit        GRAMS       GRAMS       GRAMS
    topLevelIdentifier                            a86507      a86507      a86507
    taskIdentifier                             c6ba-1012   cc48-07bf      35b896
    taskLocationIdentifier                          3438        4b32       7bcca
    taskTime.value                                     0           0           0
    taskTime.unit                                SECONDS     SECONDS     SECONDS