Search code examples
pythonjsonpandasdataframepandas-merge

How merge or join data in a Pandas nested DataFrame


I'm trying to figure out how to perform a Merge or Join on a nested field in a DataFrame. Below is some example data:

df_all_groups = pd.read_json("""
[
    {
        "object": "group",
        "id": "group-one",
        "collections": [
            {
                "id": "111-111-111",
                "readOnly": false
            },
            {
                "id": "222-222-222",
                "readOnly": false
            }
        ]
    },
    {
        "object": "group",
        "id": "group-two",
        "collections": [
            {
                "id": "111-111-111",
                "readOnly": false
            },
            {
                "id": "333-333-333",
                "readOnly": false
            }
        ]
    }
]
""")

df_collections_with_names = pd.read_json("""
[
    {
        "object": "collection",
        "id": "111-111-111",
        "externalId": null,
        "name": "Cats"
      },
      {
        "object": "collection",
        "id": "222-222-222",
        "externalId": null,
        "name": "Dogs"
      },
      {
        "object": "collection",
        "id": "333-333-333",
        "externalId": null,
        "name": "Fish"
      }
]
""")

I'm trying to add the name field from df_collections_with_names to each df_all_groups['collections'][<index>] by joining on df_all_groups['collections'][<index>].id The output I'm trying to get to is:

[
    {
        "object": "group",
        "id": "group-one",
        "collections": [
            {
                "id": "111-111-111",
                "readOnly": false,
                "name": "Cats" // See Collection name was added
            },
            {
                "id": "222-222-222",
                "readOnly": false,
                "name": "Dogs" // See Collection name was added
            }
        ]
    },
    {
        "object": "group",
        "id": "group-two",
        "collections": [
            {
                "id": "111-111-111",
                "readOnly": false,
                "name": "Cats" // See Collection name was added
            },
            {
                "id": "333-333-333",
                "readOnly": false,
                "name": "Fish" // See Collection name was added
            }
        ]
    }
]

I've tried to use the merge method, but can't seem to get it to run on the collections nested field as I believe it's a series at that point.


Solution

  • Here's one approach:

    First convert the json string used to construct df_all_groups (I named it all_groups here) to a dictionary using json.loads. Then use json_normalize to contruct a DataFrame with it.

    Then merge the DataFrame constructed above with df_collections_with_names; we have "names" column now.

    The rest is constructing the desired dictionary from the result obtained above; groupby + apply(to_dict) + reset_index + to_dict will fetch the desired outcome:

    import json
    out = (pd.json_normalize(json.loads(all_groups), ['collections'], ['object', 'id'], meta_prefix='_')
           .merge(df_collections_with_names, on='id', suffixes=('','_'))
           .drop(columns=['object','externalId']))
    out = (out.groupby(['_object','_id']).apply(lambda x: x[['id','readOnly','name']].to_dict('records'))
           .reset_index(name='collections'))
    out.rename(columns={c: c.strip('_') for c in out.columns}).to_dict('records')
    

    Output:

    [{'object': 'group',
      'id': 'group-one',
      'collections': [{'id': '111-111-111', 'readOnly': False, 'name': 'Cats'},
       {'id': '222-222-222', 'readOnly': False, 'name': 'Dogs'}]},
     {'object': 'group',
      'id': 'group-two',
      'collections': [{'id': '111-111-111', 'readOnly': False, 'name': 'Cats'},
       {'id': '333-333-333', 'readOnly': False, 'name': 'Fish'}]}]