import pandas as pd
from pandas import json_normalize
Suppose I have the following data:
data = [
{
'Name':'Rocco',
'Year':2020,
'Location':'Itay',
'Source':[
{'Movies':
{'MovieNumber':1, 'Money':1000, 'Percent':10}
}
]
},
{
'Name':'Anja',
'Year':2021,
'Location':'Germany',
'Source':[
{'Movies': [
{'MovieNumber':1, 'Money':2000, 'Percent':10},
{'MovieNumber':2, 'Money':3000, 'Percent':10}
]
}
]
},
{
'Name':'Kasia',
'Year':2021,
'Location':'Poland',
'Source':[
{'Movies': [
{'MovieNumber':1, 'Money':1000, 'Percent':10},
{'MovieNumber':2, 'Money':1000, 'Percent':10},
{'MovieNumber':3, 'Money':1000, 'Percent':10}
]
}
]
}
]
This is what it looks like as a pandas DF:
df = pd.DataFrame(data)
df
This was my first attempt at unpacking:
results = pd.json_normalize(
data,
record_path = ['Source'],
meta = ['Year']
)
This only did the first level so Rocco's information came out correct but Anja's and Kasia's did not because the data structure inside Source
changed to a list.
My next attempt involved the following but it too did not work:
results2 = pd.json_normalize(
data,
record_path = ['Source'],
meta = [
'Year',
['Source'],
]
)
From looking at the documentation and YouTube videos, I think I need another element in both the record_path
field and meta
field but I don't know what would go there. I've tried different combinations but none of them worked.
Does anyone know how I can unpack this? I would like the end result to be a DF with the following columns: Year, Name, Location, Movies.MovieNumber, Movies.Money, Movies.Percent. So, there should be a total of 6 rows.
Fix your dictionary first, it's not consistent, this makes it consistent:
for i, x in enumerate(data):
x = x['Source'][0]['Movies']
if not isinstance(x, list):
data[i]['Source'][0]['Movies'] = [x]
Then json_normalize
works just fine:
df = pd.json_normalize(data, ['Source','Movies'], ['Name', 'Year', 'Location'])
print(df)
Output:
MovieNumber Money Percent Name Year Location
0 1 1000 10 Rocco 2020 Itay
1 1 2000 10 Anja 2021 Germany
2 2 3000 10 Anja 2021 Germany
3 1 1000 10 Kasia 2021 Poland
4 2 1000 10 Kasia 2021 Poland
5 3 1000 10 Kasia 2021 Poland
What my code actually did, Before:
[
{
"Name": "Rocco",
"Year": 2020,
"Location": "Itay",
"Source": [
{
"Movies": # Here, Movies isn't a list.
{"MovieNumber": 1, "Money": 1000, "Percent": 10}
}
]
},
{
"Name": "Anja",
"Year": 2021,
"Location": "Germany",
"Source": [
{
"Movies": [ # Here, Movies is a list.
{"MovieNumber": 1, "Money": 2000, "Percent": 10},
{"MovieNumber": 2, "Money": 3000, "Percent": 10}
]
}
]
}
]
After:
[
{
"Name": "Rocco",
"Year": 2020,
"Location": "Itay",
"Source": [
{
"Movies": [ # Now this is a list.
{"MovieNumber": 1, "Money": 1000, "Percent": 10}
]
}
]
},
{
"Name": "Anja",
"Year": 2021,
"Location": "Germany",
"Source": [
{
"Movies": [ # And this remains unchanged.
{"MovieNumber": 1, "Money": 2000, "Percent": 10},
{"MovieNumber": 2, "Money": 3000, "Percent": 10 }
]
}
]
}
]
So all I did was force all Source.Movies
to be lists, by putting the contents in a list if it wasn't already a list.