python json pandas dictionary json-normalize

json_normalize produces a KeyError when trying to extract certain attributes

Here is a subset of my json file:

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

I want to place this into a dataframe with each question and a row for each answer.

Python code:

from pandas import json_normalize
import json

fields = ['text','answers.text']

with open(R'response.json') as f:
    d = json.load(f)

data = json_normalize(d['data'],['questions'],errors='ignore')
data = data[fields]

print(data)

Which produces a KeyError:

KeyError: "['answers.text'] not in index"

Been at this for a few hours and absolutely cannot figure this out. I feel like it should be quite simple, but it never is.

Solution

Use record_prefix, with record_path and meta, so d can be normalized all at once
- pd.json_normalize will result in a ValueError when there are overlapping key names between the record_path and meta, and 'id' and 'text' are in both.
- ValueError: Conflicting metadata name id, need distinguishing prefix occurs without using record_path.
The KeyError occurs because 'answers.text' is not in d, it's created by .json_normalize()
If there are any top level keys that aren't required in df, remove them from meta.

import pandas as pd

# normalize d
df = pd.json_normalize(data=d['data']['questions'],
                       record_path= ['answers'],
                       meta=['id', 'text', 'instructionalText', 'minimumResponses', 'maximumResponses', 'sortOrder'],
                       record_prefix='answers_')

# display(df)
   answers_id answers_text answers_parentId    id         text     instructionalText minimumResponses maximumResponses sortOrder
0      362949    Answer #1             None  6574  Question #1                                      0             None         1
1      362950    Answer #2             None  6574  Question #1                                      0             None         1
2      362951    Answer #3             None  6574  Question #1                                      0             None         1
3      362952    Answer #4             None  6574  Question #1                                      0             None         1
4      262949    Answer #1             None  4756  Question #2  No cheating, cheater                0             None         1
5      262950    Answer #2             None  4756  Question #2  No cheating, cheater                0             None         1
6      262951    Answer #3             None  4756  Question #2  No cheating, cheater                0             None         1
7      262952    Answer #4             None  4756  Question #2  No cheating, cheater                0             None         1

Expanded Test Data

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]},
                            {'id': 4756,
                             'text': 'Question #2',
                             'instructionalText': 'No cheating, cheater',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 262949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 262950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 262951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 262952, 'text': 'Answer #4', 'parentId': None}]}]}}

In regards to the other answer, using .apply(pd.Series) is not recommended, because it is incredibly slow.
- See the timing analysis in SO: Splitting dictionary/list inside a Pandas Column into Separate Columns
- 53 minutes for 10M rows