Search code examples
pythonjsonjson-normalize

json_normalise issue when "record_path" variable uses nested data & "meta" variable has nested data, works fine without nested data for "record_path"


I am running into a problem with json_normalise, I can't wrap my head around it. The issue arises when the "record_path" part of the function has nested data (['extra', 'students']), I can't seem to get used nested data in the "meta" part of the function. Works fine when the record path is not nested. Any ideas?

    json_list = [
    {
        'class': 'Year 1',
        'student count': 20,
        'room': 'Yellow',
        'info': {
            'teachers': {
                'math': 'Rick Scott',
                'physics': 'Elon Mask'
            }
        },
        'extra': {
            'students': [
                {
                    'name': 'Tom',
                    'sex': 'M',
                    'grades': { 'math': 66, 'physics': 77 }
                },
                {
                    'name': 'James',
                    'sex': 'M',
                    'grades': { 'math': 80, 'physics': 78 }
                },
            ]
        }
    },
    {
        'class': 'Year 2',
        'student count': 25,
        'room': 'Blue',
        'info': {
            'teachers': {
                'math': 'Alan Turing',
                'physics': 'Albert Einstein'
            }
        },
        'extra': {
            'students': [
                { 'name': 'Tony', 'sex': 'M' },
                { 'name': 'Jacqueline', 'sex': 'F' },
            ]
        }
    },
]

print(pd.json_normalize(
    json_list,
    record_path = ['extra', 'students'],
    meta=['class', 'room', ['info', 'teachers', 'math']]
) )

Solution

  • You can transform the json_list before creating the dataframe or try to modify dataframe step-by-step by .apply(pd.Series)/.explode:

    df = pd.DataFrame(json_list)
    df = pd.concat(
        [df, df.pop("info").apply(pd.Series), df.pop("extra").apply(pd.Series)],
        axis=1,
    ).explode("students")
    df = pd.concat(
        [
            df,
            df.pop("teachers").apply(pd.Series).add_prefix("teachers_"),
            df.pop("students").apply(pd.Series),
        ],
        axis=1,
    )
    df = pd.concat(
        [df, df.pop("grades").apply(pd.Series).add_prefix("grades_")],
        axis=1,
    ).drop(columns="grades_0")
    
    print(df)
    

    Prints:

        class  student count    room teachers_math teachers_physics        name sex  grades_math  grades_physics
    0  Year 1             20  Yellow    Rick Scott        Elon Mask         Tom   M         66.0            77.0
    0  Year 1             20  Yellow    Rick Scott        Elon Mask       James   M         80.0            78.0
    1  Year 2             25    Blue   Alan Turing  Albert Einstein        Tony   M          NaN             NaN
    1  Year 2             25    Blue   Alan Turing  Albert Einstein  Jacqueline   F          NaN             NaN