I'm dealing with a complex JSON string in Python, and I'm having trouble getting the data into a Pandas Data Frame. Sample data, current output, and expected output are all below. I'm attempting to use json_normalize, but it isn't fully normalizing, leaving me with several columns containing JSON lists. My end goal is to insert this into a SQL DB, however the json strings are being truncated due to their length, so I can not use my initial plan to use OPENJSON() to perform the conversion in SQL.
My json_normalize looks like this below. Using max_level=10 was working for all the other API calls I was making for this resource except for this specific table. I've tried listing it out by column as well, but my attempts at this have been failing.
incomingJson = json.loads(incomingString)
incomingDataFrame = pd.json_normalize(incomingJson, 'fields', max_level=10)
incomingString looks like the json below. This is dummy data, just to clarify.
{
"fields": [
{
"id": 12345,
"description": "Location",
"choices": [
{
"id": 1,
"name": "Seattle"
},
{
"id": 2,
"name": "Nashville"
}
],
"other_choices": []
},
{
"id": 67372,
"description": "Name",
"choices": [
{
"id": 101,
"name": "John"
},
{
"id": 102,
"name": "Mary"
}
],
"other_choices": []
},
{
"id": 39265,
"description": "Department",
"choices": [
{
"id": 143,
"name": "IT"
},
{
"id": 226,
"name": "HR"
},
{
"id": 394,
"name": "ERG"
}
],
"other_choices": [
{
"id": 9473
}
]
}
]
}
When I run my code, I get an output like the one below.
id | description | choices | other_choices |
---|---|---|---|
12345 | Location | [{"id": 1, "name": "Seattle"}, {"id": 2, "name": "Nashville"}] | [] |
67372 | Name | [{"id": 101, "name": "John"}, {"id": 102, "name": "Mary"}] | [] |
39265 | Department | {"id": 143, "name": "IT"}, {"id": 226, "name": "HR"}, {"id": 394, "name": "ERG"} | [{"id": 9473}] |
I am expecting an output more similar to the table below
id | description | choices.id | choices.name | other_choices.id |
---|---|---|---|---|
12345 | Location | 1 | Seattle | |
12345 | Location | 2 | Nashville | |
67372 | Name | 101 | John | |
67372 | Name | 102 | Mary | |
39265 | Department | 143 | IT | 9473 |
39265 | Department | 226 | HR | 9473 |
39265 | Department | 394 | ERG | 9473 |
Any and all help is appreciated.
The meta
and record_path
arguments proved useful.
df_c = pd.json_normalize(
json["fields"],
record_path=["choices"],
meta=["id", "description"],
record_prefix="choices.",
)
df_oc = pd.json_normalize(
json["fields"],
record_path=["other_choices"],
meta=["id"],
record_prefix="other_choices.",
)
col_order = [
"id",
"description",
"choices.id",
"choices.name",
"other_choices.id",
]
df = pd.merge(df_c, df_oc, how="left", on="id")[col_order]
record_path
points to the entries which are lists and meta
points to entries that are single values. I could not find a way to point to multiple record_path
, so I split the dfs and joined them.