I'm trying to create a pandas dataframe form json file. I've seen a multiple solutions to this problem which uses built in functions from_dict/json_normalize yet I'm unable to apply it to my code. Here's how my data is structured in json file:
"data": [
{
"groups": {
"data": [
{
"group": "Math",
"year_joined": "2009"
},
{
"group_name": "History",
"year_joined": "2011"
},
{
"group_name": "Biology",
"year_joined": "2010"
}
]
},
"id": "12512"
},
When I'm trying to normalize this data with pandas function like this:
path = 'mypath'
f = open(path)
data = json.load(f)
test = pd.json_normalize(
data['data'],
errors='ignore')
I just receive something like this:
id groups.data
0 12512 [{'group_name': 'Math', 'year_joined': '2009', 'gr...
1 23172 [{'group_name': 'Chemistry', 'year_joined': '2005'...
I want this data to look like this (solution 1):
id group year_joined
0 12512 group1 year1
1 12512 group2 year2
2 12512 group3 year3
Or like this (solution 2):
id group year_joined
0 12512 group1,group2,group3 year1,year2,year3
1 23172 group4,group5 year4,year5
How can i achieve it? I tried passing 'record_path' parameter to 'json_normalize' function but it doesn't change anything. I tried to use 'DataFrame.from_dict' function to work around this but I failed. The only way I was able to get to solution 1 was to just create multiple loops that iterated through everything in json file and add it to separate list. It kinda works but takes a lot of time on bigger datasets.
How could i use built-in pandas tools to process files which are nested as dictionaries in 3rd layer of the file as presented above?
You need to collect the information from the data
dictionary
solution 1
d = {}
for group in data["data"]:
groups = [x["group_name"] for x in group['groups']["data"]]
d['id'] = d.get('id', []) + [group['id']] * len(groups)
d['group'] = d.get('group', []) + groups
d['year_joined'] = d.get('year_joined', []) + [x["year_joined"] for x in group['groups']["data"]]
df = pd.DataFrame(d)
Output
id group year_joined
0 12512 Math 2009
1 12512 History 2011
2 12512 Biology 2010
3 23172 Chemistry 2007
4 23172 Economics 2008
solution 2
d = {}
for group in data["data"]:
d['id'] = d.get('id', []) + [group['id']]
d['group'] = d.get('group', []) + [','.join(x["group_name"] for x in group['groups']["data"])]
d['year_joined'] = d.get('year_joined', []) + [','.join(x["year_joined"] for x in group['groups']["data"])]
df = pd.DataFrame(d)
Output
id group year_joined
0 12512 Math,History,Biology 2009,2011,2010
1 23172 Chemistry,Economics 2007,2008