Here is a subset of my json file:
d = {'data': {'questions': [{'id': 6574,
'text': 'Question #1',
'instructionalText': '',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
{'id': 362950, 'text': 'Answer #2', 'parentId': None},
{'id': 362951, 'text': 'Answer #3', 'parentId': None},
{'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}
I want to place this into a dataframe with each question and a row for each answer.
Python code:
from pandas import json_normalize
import json
fields = ['text','answers.text']
with open(R'response.json') as f:
d = json.load(f)
data = json_normalize(d['data'],['questions'],errors='ignore')
data = data[fields]
print(data)
Which produces a KeyError:
KeyError: "['answers.text'] not in index"
Been at this for a few hours and absolutely cannot figure this out. I feel like it should be quite simple, but it never is.
record_prefix
, with record_path
and meta
, so d
can be normalized all at once
pd.json_normalize
will result in a ValueError
when there are overlapping key
names between the record_path
and meta
, and 'id'
and 'text'
are in both.ValueError: Conflicting metadata name id, need distinguishing prefix
occurs without using record_path
.KeyError
occurs because 'answers.text'
is not in d
, it's created by .json_normalize()
keys
that aren't required in df
, remove them from meta
.import pandas as pd
# normalize d
df = pd.json_normalize(data=d['data']['questions'],
record_path= ['answers'],
meta=['id', 'text', 'instructionalText', 'minimumResponses', 'maximumResponses', 'sortOrder'],
record_prefix='answers_')
# display(df)
answers_id answers_text answers_parentId id text instructionalText minimumResponses maximumResponses sortOrder
0 362949 Answer #1 None 6574 Question #1 0 None 1
1 362950 Answer #2 None 6574 Question #1 0 None 1
2 362951 Answer #3 None 6574 Question #1 0 None 1
3 362952 Answer #4 None 6574 Question #1 0 None 1
4 262949 Answer #1 None 4756 Question #2 No cheating, cheater 0 None 1
5 262950 Answer #2 None 4756 Question #2 No cheating, cheater 0 None 1
6 262951 Answer #3 None 4756 Question #2 No cheating, cheater 0 None 1
7 262952 Answer #4 None 4756 Question #2 No cheating, cheater 0 None 1
d = {'data': {'questions': [{'id': 6574,
'text': 'Question #1',
'instructionalText': '',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
{'id': 362950, 'text': 'Answer #2', 'parentId': None},
{'id': 362951, 'text': 'Answer #3', 'parentId': None},
{'id': 362952, 'text': 'Answer #4', 'parentId': None}]},
{'id': 4756,
'text': 'Question #2',
'instructionalText': 'No cheating, cheater',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 262949, 'text': 'Answer #1', 'parentId': None},
{'id': 262950, 'text': 'Answer #2', 'parentId': None},
{'id': 262951, 'text': 'Answer #3', 'parentId': None},
{'id': 262952, 'text': 'Answer #4', 'parentId': None}]}]}}
.apply(pd.Series)
is not recommended, because it is incredibly slow.