python pandas dataframe web-scraping json-normalize

Pandas's JSON_normalize for the nested dictionary

I have the following nested dictionary that contains the information that a person publicly reported to the organization.

`person_json={'basicInformation': {'individualId': 5429958, 
'firstName': 'BARBARA','middleName': 'JEAN','lastName': 'ABADI',
'sanctions': {'permanentBar': 'Y',
'sanctionDetails': [{'category': 'BAR',
 'regulator': 'FINRA',
 'messages': ['FINRA_BAR_ALL_MESSAGE'],
 'detail': [],
 'capacity': ['ALL']}]},
 'otherNames': [],
 'bcScope': 'InActive',
 'iaScope': 'NotInScope',
 'daysInIndustry': 3185},
 'currentEmployments': [],
 'currentIAEmployments': [],
 'previousEmployments': [{'iaOnly': 'N',
 'bdSECNumber': '44376',
 'firmId': 29705,
 'firmName': 'ACGM, INC.',
 'street1': '590 MADISON AVENUE, 41ST FLOOR, SUITE 4103',
 'city': 'NEW YORK',
 'state': 'NY',
 'zipCode': '10022',
 'registrationBeginDate': '6/1/2011',
 'registrationEndDate': '2/1/2017',
 'firmBCScope': 'INACTIVE',
 'firmIAScope': 'NOTINSCOPE'},
 {'iaOnly': 'N',
 'bdSECNumber': '49900',
 'firmId': 42619,
 'firmName': 'ACTIVA CAPITAL MARKETS, INC.',
 'street1': '590 MADISON AVENUE',
 'street2': '41ST FLOOR',
 'city': 'NEW YORK',
 'state': 'NY',
 'zipCode': '10022',
 'registrationBeginDate': '2/1/2008',
 'registrationEndDate': '2/16/2011',
 'firmBCScope': 'INACTIVE',
 'firmIAScope': 'NOTINSCOPE'}],
 'previousIAEmployments': [],
 'disclosureFlag': 'Y',
 'iaDisclosureFlag': 'N',
 'disclosures': [{'eventDate': '4/4/2017',
 'disclosureType': 'Regulatory',
 'disclosureResolution': 'Final',
 'isIapdExcludedCCFlag': 'N',
 'isBcExcludedCCFlag': 'N',
 'bcCtgryType': 11,
 'disclosureDetail': {'DocketNumberFDA': '2015044587502',
 'DocketNumberAAO': '2015044587502',
 'Initiated By': 'FINRA',
 'Allegations': 'Without admitting or denying the findings, Abadi consented to the 
 sanctions and to the entry of findings that she refused to produce documents and information and to appear for on-the-record testimony requested by FINRA in connection with its investigation concerning certain suspicious fund transfers involving her member firm and affiliates thereof indirectly owned and controlled by Abadi.',
'Resolution': 'Acceptance, Waiver & Consent(AWC)',
'SanctionDetails': [{'Sanctions': 'Bar (Permanent)',
  'SanctionDetails': [{'Registration Capacities Affected': 'All Capacities',
    'Duration': 'Indefinite',
    'Start Date': '4/4/2017'}]}]}}],
'examsCount': {'stateExamCount': 1,
'principalExamCount': 1,
'productExamCount': 2},
'stateExamCategory': [{'examCategory': 'Series 63',
'examName': 'Uniform Securities Agent State Law Examination',
'examTakenDate': '3/13/2008',
'examScope': 'BC'}],
'principalExamCategory': [{'examCategory': 'Series 24',
'examName': 'General Securities Principal Examination',
'examTakenDate': '9/26/2008',
'examScope': 'BC'}],
'productExamCategory': [{'examCategory': 'SIE',
'examName': 'Securities Industry Essentials Examination',
'examTakenDate': '2/1/2017',
'examScope': 'BC'},
{'examCategory': 'Series 7',
'examName': 'General Securities Representative Examination',
'examTakenDate': '1/31/2008',
'examScope': 'BC'}],
'registrationCount': {'approvedSRORegistrationCount': 0,
'approvedFinraRegistrationCount': 0,
'approvedStateRegistrationCount': 0},
'registeredStates': [],
'registeredSROs': [],
'brokerDetails': {'hasBCComments': 'N',
'hasIAComments': 'N',
'legacyReportStatusDescription': 'Not Requested'}}`

What I try to do is to tabulate the information in the above nested dictionary. This is the code I used. The table is named 'test1'.

`test1=pds.json_normalize(person_json)`

What I got from 'test1'is the following table. I captured the photo of some part of the table.

As you may see, there are still a list in "previousEmployments", "disclosures", "stateExamCategory", "principalExamCategory", "productExamCategory", and I am trying to flatten these data and joining them together in the table format.

The expected outcome should be a dataframe/a table that contain all information from all keys and all 'sub' keys of person_json.

How could I do this?

Thank you,

Solution

Assuming you ignore how should look like your final table/df, here is a proposition with flatten_json/json_normalize to make a DataFrame where the columns are the outer keys of person_json :

#pip install flatten_json
from flatten_json import flatten

tmp = pd.json_normalize(flatten(person_json))

df = (
    tmp.set_axis(tmp.columns.str.split("_", n=1, expand=True), axis=1)
            .stack(1).droplevel(0)
)

Output :

print(df)

                                             basicInformation  ... stateExamCategory
NaN                                                       NaN  ...               NaN
0_bcCtgryType                                             NaN  ...               NaN
0_bdSECNumber                                             NaN  ...               NaN
...                                                       ...  ...               ...
sanctions_sanctionDetails_0_messages_0  FINRA_BAR_ALL_MESSAGE  ...               NaN
sanctions_sanctionDetails_0_regulator                   FINRA  ...               NaN
stateExamCount                                            NaN  ...               NaN

[72 rows x 16 columns]

If you wanna retrieve a specific portion/slice, you can use something like :

query = df.filter(regex=".*ExamCount.*", axis=0).dropna(how="all", axis=1)
# query = df.filter(regex=".*ExamCount.*", axis=0).loc[:, "examsCount"] #variant

print(query) 

                    examsCount
principalExamCount        1.00
productExamCount          2.00
stateExamCount            1.00