I get the data in a predefined format I cannot change:
data =
{
"id": 12345,
"name": "Doe",
"gender": {
"textEn": "Masculin"
},
"professions": [
{
"job_description": {
"textEn": "Job description"
},
"cetTitles": [
{
"cetTitleType": {
"textEn": "Recognition"
},
"issuanceDate": "1992-04-14T00:00:00Z",
"phoneNumbers": [
"123 221 00 70"
]
}
]
}
]
}
While I can normalise data one level down ('gender') through the pd.json_normalize function, I struggle to access info further down in the hierarchy.
I tried e.g. getting the job description through
data = pd.json_normalize(data,record_path=['professions','job_description'],meta='id')
but I get a TypeError. How can I address this and extract all data into a table so the data above is represented as one row with all content as fields?
The output I would like to ultimately achieve looks as follows:
id | name | gender | job_description | cetTitleType | issuanceDate | phoneNumbers |
---|---|---|---|---|---|---|
12345 | Doe | Masculin | Job description | Recognition | 1992-04-14T00:00:00Z | 123 221 00 70 |
Using json_normalize():
Code:
df = pd.json_normalize(
data=data,
record_path=["professions", "cetTitles"],
meta=["id", "name", "gender", ["professions", "job_description"]]
).explode(column="phoneNumbers")
df = df.join(pd.DataFrame(df.pop("gender").values.tolist()))
df.columns = df.columns.str.split(".").str[-1]
df = df.rename(columns={"textEn": "gender"})
df = df.join(pd.DataFrame(df.pop("job_description").values.tolist()))
df = df.rename(columns={"textEn": "job_description"})
print(df)
Output:
issuanceDate phoneNumbers gender id name gender job_description
0 1992-04-14T00:00:00Z 123 221 00 70 Recognition 12345 Doe Masculin Job description