I am using pd.json_normalize
to flatten the "sections"
field in this data into rows. It works fine except for rows where the "sections"
is an empty list.
This ID gets completely ignored and is missing from the final flattened dataframe. I need to make sure that I have at least one row per unique ID in the data (some IDs may have many rows up to one row per unique ID, per unique section_id
, question_id
, and answer_id
as I unnest more fields in the data):
{'_id': '5f48f708fe22ca4d15fb3b55',
'created_at': '2020-08-28T12:22:32Z',
'sections': []}]
Sample data:
sample = [{'_id': '5f48bee4c54cf6b5e8048274',
'created_at': '2020-08-28T08:23:00Z',
'sections': [{'comment': '',
'type_fail': None,
'answers': [{'comment': 'stuff',
'feedback': [],
'value': 10.0,
'answer_type': 'default',
'question_id': '5e59599c68369c24069630fd',
'answer_id': '5e595a7c3fbb70448b6ff935'},
{'comment': 'stuff',
'feedback': [],
'value': 10.0,
'answer_type': 'default',
'question_id': '5e598939cedcaf5b865ef99a',
'answer_id': '5e598939cedcaf5b865ef998'}],
'score': 20.0,
'passed': True,
'_id': '5e59599c68369c24069630fe',
'custom_fields': []},
{'comment': '',
'type_fail': None,
'answers': [{'comment': '',
'feedback': [],
'value': None,
'answer_type': 'not_applicable',
'question_id': '5e59894f68369c2398eb68a8',
'answer_id': '5eaad4e5b513aed9a3c996a5'},
{'comment': '',
'feedback': [],
'value': None,
'answer_type': 'not_applicable',
'question_id': '5e598967cedcaf5b865efe3e',
'answer_id': '5eaad4ece3f1e0794372f8b2'},
{'comment': "stuff",
'feedback': [],
'value': 0.0,
'answer_type': 'default',
'question_id': '5e598976cedcaf5b865effd1',
'answer_id': '5e598976cedcaf5b865effd3'}],
'score': 0.0,
'passed': True,
'_id': '5e59894f68369c2398eb68a9',
'custom_fields': []}]},
{'_id': '5f48f708fe22ca4d15fb3b55',
'created_at': '2020-08-28T12:22:32Z',
'sections': []}]
Tests:
df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")
At this point I am now missing a row for ID "5f48f708fe22ca4d15fb3b55" which I still need.
df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")
Can I alter this somehow to make sure that I get one row per ID at minimum? I'm dealing with millions of records and don't want to realize later that some IDs were missing from my final data. The only solution I can think of is to normalize each dataframe and then left join it to the original dataframe again.
dict
sections
is an empty list
, fill it with [{'answers': [{}]}]
for i, d in enumerate(sample):
if not d['sections']:
sample[i]['sections'] = [{'answers': [{}]}]
df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")
# display(df2)
section_comment section_type_fail section_answers section_score section_passed section__id section_custom_fields _id created_at
0 NaN [{'comment': 'stuff', 'feedback': [], 'value': 10.0, 'answer_type': 'default', 'question_id': '5e59599c68369c24069630fd', 'answer_id': '5e595a7c3fbb70448b6ff935'}, {'comment': 'stuff', 'feedback': [], 'value': 10.0, 'answer_type': 'default', 'question_id': '5e598939cedcaf5b865ef99a', 'answer_id': '5e598939cedcaf5b865ef998'}] 20.0 True 5e59599c68369c24069630fe [] 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z
1 NaN [{'comment': '', 'feedback': [], 'value': None, 'answer_type': 'not_applicable', 'question_id': '5e59894f68369c2398eb68a8', 'answer_id': '5eaad4e5b513aed9a3c996a5'}, {'comment': '', 'feedback': [], 'value': None, 'answer_type': 'not_applicable', 'question_id': '5e598967cedcaf5b865efe3e', 'answer_id': '5eaad4ece3f1e0794372f8b2'}, {'comment': 'stuff', 'feedback': [], 'value': 0.0, 'answer_type': 'default', 'question_id': '5e598976cedcaf5b865effd1', 'answer_id': '5e598976cedcaf5b865effd3'}] 0.0 True 5e59894f68369c2398eb68a9 [] 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z
2 NaN NaN [{}] NaN NaN NaN NaN 5f48f708fe22ca4d15fb3b55 2020-08-28T12:22:32Z
df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")
# display(df3)
comment feedback value answer_type question_id answer_id _id created_at section__id section_score section_passed section_type_fail section_comment
0 stuff [] 10.0 default 5e59599c68369c24069630fd 5e595a7c3fbb70448b6ff935 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59599c68369c24069630fe 20 True NaN
1 stuff [] 10.0 default 5e598939cedcaf5b865ef99a 5e598939cedcaf5b865ef998 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59599c68369c24069630fe 20 True NaN
2 [] NaN not_applicable 5e59894f68369c2398eb68a8 5eaad4e5b513aed9a3c996a5 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59894f68369c2398eb68a9 0 True NaN
3 [] NaN not_applicable 5e598967cedcaf5b865efe3e 5eaad4ece3f1e0794372f8b2 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59894f68369c2398eb68a9 0 True NaN
4 stuff [] 0.0 default 5e598976cedcaf5b865effd1 5e598976cedcaf5b865effd3 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59894f68369c2398eb68a9 0 True NaN
5 NaN NaN NaN NaN NaN NaN 5f48f708fe22ca4d15fb3b55 2020-08-28T12:22:32Z NaN NaN NaN NaN NaN