Search code examples
pythonjsonpandasjson-normalize

How do I unpack multiple levels using json_normalize in python pandas?


import pandas as pd
from pandas import json_normalize

Suppose I have the following data:

data = [
    {
        'Name':'Rocco',
        'Year':2020,
        'Location':'Itay',
        'Source':[
            {'Movies': 
             {'MovieNumber':1, 'Money':1000, 'Percent':10}
            }
        ]
    },
    {
        'Name':'Anja',
        'Year':2021,
        'Location':'Germany',
        'Source':[
            {'Movies': [
                {'MovieNumber':1, 'Money':2000, 'Percent':10},
                {'MovieNumber':2, 'Money':3000, 'Percent':10}
            ]
            }
        ]
    },
    {
        'Name':'Kasia',
        'Year':2021,
        'Location':'Poland',
        'Source':[
            {'Movies': [
                {'MovieNumber':1, 'Money':1000, 'Percent':10},
                {'MovieNumber':2, 'Money':1000, 'Percent':10},
                {'MovieNumber':3, 'Money':1000, 'Percent':10}
            ]
            }
        ]
    }
]

This is what it looks like as a pandas DF:

df = pd.DataFrame(data)
df

df1

This was my first attempt at unpacking:

results = pd.json_normalize(
    data,
    record_path = ['Source'],
    meta = ['Year']
)

results

This only did the first level so Rocco's information came out correct but Anja's and Kasia's did not because the data structure inside Source changed to a list.

My next attempt involved the following but it too did not work:

results2 = pd.json_normalize(
    data,
    record_path = ['Source'],
    meta = [
        'Year',
        ['Source'],
           ]
)

results2

From looking at the documentation and YouTube videos, I think I need another element in both the record_path field and meta field but I don't know what would go there. I've tried different combinations but none of them worked.

Does anyone know how I can unpack this? I would like the end result to be a DF with the following columns: Year, Name, Location, Movies.MovieNumber, Movies.Money, Movies.Percent. So, there should be a total of 6 rows.


Solution

  • Fix your dictionary first, it's not consistent, this makes it consistent:

    for i, x in enumerate(data):
        x = x['Source'][0]['Movies']
        if not isinstance(x, list):
            data[i]['Source'][0]['Movies'] = [x]
    

    Then json_normalize works just fine:

    df = pd.json_normalize(data, ['Source','Movies'], ['Name', 'Year', 'Location'])
    print(df)
    

    Output:

       MovieNumber  Money  Percent   Name  Year Location
    0            1   1000       10  Rocco  2020     Itay
    1            1   2000       10   Anja  2021  Germany
    2            2   3000       10   Anja  2021  Germany
    3            1   1000       10  Kasia  2021   Poland
    4            2   1000       10  Kasia  2021   Poland
    5            3   1000       10  Kasia  2021   Poland
    

    What my code actually did, Before:

    [
      {
        "Name": "Rocco",
        "Year": 2020,
        "Location": "Itay",
        "Source": [
          {
            "Movies": # Here, Movies isn't a list.
              {"MovieNumber": 1, "Money": 1000, "Percent": 10}
          }
        ]
      },
      {
        "Name": "Anja",
        "Year": 2021,
        "Location": "Germany",
        "Source": [
          {
            "Movies": [ # Here, Movies is a list.
              {"MovieNumber": 1, "Money": 2000, "Percent": 10},
              {"MovieNumber": 2, "Money": 3000, "Percent": 10}
            ]
          }
        ]
      }
    ]
    

    After:

    [
      {
        "Name": "Rocco",
        "Year": 2020,
        "Location": "Itay",
        "Source": [
          {
            "Movies": [ # Now this is a list.
              {"MovieNumber": 1, "Money": 1000, "Percent": 10}
            ]
          }
        ]
      },
      {
        "Name": "Anja",
        "Year": 2021,
        "Location": "Germany",
        "Source": [
          {
            "Movies": [ # And this remains unchanged.
              {"MovieNumber": 1, "Money": 2000, "Percent": 10},
              {"MovieNumber": 2, "Money": 3000, "Percent": 10 }
            ]
          }
        ]
      }
    ]
    

    So all I did was force all Source.Movies to be lists, by putting the contents in a list if it wasn't already a list.