Search code examples
pythonjsondataframe

How to convert a nested JSON to a Dataframe?


A response.json() returns a JSON in the following format:

{
    "workbooks": [
        {
            "name": "WORKBOOK_A",
            "embeddedDatasources": [
                {
                    "upstreamTables": [
                        {"name": "WORKBOOK_A_TABLE_A"}]},
                {
                    "upstreamTables": [
                        {"name": "WORKBOOK_A_TABLE_B"},
                        {"name": "WORKBOOK_A_TABLE_C"}]},
                {
                    "upstreamTables": []}]},
        {
            "name": "WORKBOOK_B",
            "embeddedDatasources": [
                {
                    "upstreamTables": [
                        {"name": "WORKBOOK_B_TABLE_A"},
                        {"name": "WORKBOOK_B_TABLE_B"}]},
                {
                    "upstreamTables": [
                        {"name": "WORKBOOK_B_TABLE_C"},
                        {"name": "WORKBOOK_B_TABLE_D"}]}]}]}

I am trying to convert it to a dataframe like this:

workbooks upstreamTables
WORKBOOK_A WORKBOOK_A_TABLE_A
WORKBOOK_A WORKBOOK_A_TABLE_B
WORKBOOK_A WORKBOOK_A_TABLE_C
WORKBOOK_B WORKBOOK_B_TABLE_A
WORKBOOK_B WORKBOOK_B_TABLE_B
WORKBOOK_B WORKBOOK_B_TABLE_C
WORKBOOK_B WORKBOOK_B_TABLE_D

"upstreamTables": [] should be ignored in this case.

Playing around with json_normalize

df = pd.json_normalize(json_data)

didn't play out so far and extracting the data as separate dataframes and rejoining them seems too convulsive.


Solution

  • Here's one approach:

    • Pass resp (i.e., response.json()) to pd.json_normalize with both record_path and meta. Add meta_prefix to avoid a ValueError: Conflicting metadata. Cf. this post. Otherwise we would end up with 2 name columns.
    • Use df.rename to rename the columns and re-order them.
    import pandas as pd
    
    # resp = {...}
    
    df = (pd.json_normalize(resp['workbooks'], 
                            record_path=['embeddedDatasources', 'upstreamTables'], 
                            meta='name', 
                            meta_prefix='meta_'
                            )
          .rename(columns={'name': 'upstreamTables',
                           'meta_name': 'workbooks'})
          [['workbooks', 'upstreamTables']]
          )
    

    Output:

        workbooks      upstreamTables
    0  WORKBOOK_A  WORKBOOK_A_TABLE_A
    1  WORKBOOK_A  WORKBOOK_A_TABLE_B
    2  WORKBOOK_A  WORKBOOK_A_TABLE_C
    3  WORKBOOK_B  WORKBOOK_B_TABLE_A
    4  WORKBOOK_B  WORKBOOK_B_TABLE_B
    5  WORKBOOK_B  WORKBOOK_B_TABLE_C
    6  WORKBOOK_B  WORKBOOK_B_TABLE_D