Search code examples
pythonjsonpandasdataframejson-normalize

How do I use json_normalize on a json containing a list of nested lists


I'm dealing with a complex JSON string in Python, and I'm having trouble getting the data into a Pandas Data Frame. Sample data, current output, and expected output are all below. I'm attempting to use json_normalize, but it isn't fully normalizing, leaving me with several columns containing JSON lists. My end goal is to insert this into a SQL DB, however the json strings are being truncated due to their length, so I can not use my initial plan to use OPENJSON() to perform the conversion in SQL.

My json_normalize looks like this below. Using max_level=10 was working for all the other API calls I was making for this resource except for this specific table. I've tried listing it out by column as well, but my attempts at this have been failing.

incomingJson = json.loads(incomingString)
incomingDataFrame = pd.json_normalize(incomingJson, 'fields', max_level=10)

incomingString looks like the json below. This is dummy data, just to clarify.


{
    "fields": [
        {
            "id": 12345,
            "description": "Location",
            "choices": [
                {
                    "id": 1,
                    "name": "Seattle"
                },
                {
                    "id": 2,
                    "name": "Nashville"
                }
            ],
            "other_choices": []
        },
        {
            "id": 67372,
            "description": "Name",
            "choices": [
                {
                    "id": 101,
                    "name": "John"
                },
                {
                    "id": 102,
                    "name": "Mary"
                }
            ],
            "other_choices": []
        },
        {
            "id": 39265,
            "description": "Department",
            "choices": [
                {
                    "id": 143,
                    "name": "IT"
                },
                {
                    "id": 226,
                    "name": "HR"
                },
                {
                    "id": 394,
                    "name": "ERG"
                }
            ],
            "other_choices": [
                {
                    "id": 9473
                }
            ]
        }
    ]
}

When I run my code, I get an output like the one below.

id description choices other_choices
12345 Location [{"id": 1, "name": "Seattle"}, {"id": 2, "name": "Nashville"}] []
67372 Name [{"id": 101, "name": "John"}, {"id": 102, "name": "Mary"}] []
39265 Department {"id": 143, "name": "IT"}, {"id": 226, "name": "HR"}, {"id": 394, "name": "ERG"} [{"id": 9473}]

I am expecting an output more similar to the table below

id description choices.id choices.name other_choices.id
12345 Location 1 Seattle
12345 Location 2 Nashville
67372 Name 101 John
67372 Name 102 Mary
39265 Department 143 IT 9473
39265 Department 226 HR 9473
39265 Department 394 ERG 9473

Any and all help is appreciated.


Solution

  • The meta and record_path arguments proved useful.

    df_c = pd.json_normalize(
        json["fields"],
        record_path=["choices"],
        meta=["id", "description"],
        record_prefix="choices.",
    )
    
    df_oc = pd.json_normalize(
        json["fields"],
        record_path=["other_choices"],
        meta=["id"],
        record_prefix="other_choices.",
    )
    
    col_order = [
        "id",
        "description",
        "choices.id",
        "choices.name",
        "other_choices.id",
    ]
    df = pd.merge(df_c, df_oc, how="left", on="id")[col_order]
    

    record_path points to the entries which are lists and meta points to entries that are single values. I could not find a way to point to multiple record_path, so I split the dfs and joined them.