Search code examples
pythonjsonpandasdictionaryjson-normalize

json_normalize produces a KeyError when trying to extract certain attributes


Here is a subset of my json file:

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

I want to place this into a dataframe with each question and a row for each answer.

Python code:

from pandas import json_normalize
import json

fields = ['text','answers.text']

with open(R'response.json') as f:
    d = json.load(f)

data = json_normalize(d['data'],['questions'],errors='ignore')
data = data[fields]

print(data)

Which produces a KeyError:

KeyError: "['answers.text'] not in index"

Been at this for a few hours and absolutely cannot figure this out. I feel like it should be quite simple, but it never is.


Solution

    • Use record_prefix, with record_path and meta, so d can be normalized all at once
      • pd.json_normalize will result in a ValueError when there are overlapping key names between the record_path and meta, and 'id' and 'text' are in both.
      • ValueError: Conflicting metadata name id, need distinguishing prefix occurs without using record_path.
    • The KeyError occurs because 'answers.text' is not in d, it's created by .json_normalize()
    • If there are any top level keys that aren't required in df, remove them from meta.
    import pandas as pd
    
    # normalize d
    df = pd.json_normalize(data=d['data']['questions'],
                           record_path= ['answers'],
                           meta=['id', 'text', 'instructionalText', 'minimumResponses', 'maximumResponses', 'sortOrder'],
                           record_prefix='answers_')
    
    # display(df)
       answers_id answers_text answers_parentId    id         text     instructionalText minimumResponses maximumResponses sortOrder
    0      362949    Answer #1             None  6574  Question #1                                      0             None         1
    1      362950    Answer #2             None  6574  Question #1                                      0             None         1
    2      362951    Answer #3             None  6574  Question #1                                      0             None         1
    3      362952    Answer #4             None  6574  Question #1                                      0             None         1
    4      262949    Answer #1             None  4756  Question #2  No cheating, cheater                0             None         1
    5      262950    Answer #2             None  4756  Question #2  No cheating, cheater                0             None         1
    6      262951    Answer #3             None  4756  Question #2  No cheating, cheater                0             None         1
    7      262952    Answer #4             None  4756  Question #2  No cheating, cheater                0             None         1
    

    Expanded Test Data

    d = {'data': {'questions': [{'id': 6574,
                                 'text': 'Question #1',
                                 'instructionalText': '',
                                 'minimumResponses': 0,
                                 'maximumResponses': None,
                                 'sortOrder': 1,
                                 'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                             {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                             {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                             {'id': 362952, 'text': 'Answer #4', 'parentId': None}]},
                                {'id': 4756,
                                 'text': 'Question #2',
                                 'instructionalText': 'No cheating, cheater',
                                 'minimumResponses': 0,
                                 'maximumResponses': None,
                                 'sortOrder': 1,
                                 'answers': [{'id': 262949, 'text': 'Answer #1', 'parentId': None},
                                             {'id': 262950, 'text': 'Answer #2', 'parentId': None},
                                             {'id': 262951, 'text': 'Answer #3', 'parentId': None},
                                             {'id': 262952, 'text': 'Answer #4', 'parentId': None}]}]}}