Search code examples
pythonjsonpandasflattenjson-normalize

Complicated, nested json - how to best flatten using python


I have spent hours on stackoverflow, geeksforgeeks, etc. trying to find the best solution for my problem but am not landing on my desired outcome. I have a nested json that I am trying to flatten but I haven't had great luck with json_normalize(), flatten_json, etc.

Here's an example of one of my json values:

[{'id': '7064574404', 'type': 'INDIVIDUAL', 'name': {'first': 'John', 'middle': 'A.', 'last': 'Doe'}, 'addresses': [{'address': '774 Pony Ct', 'city': 'Aberdeen', 'state': 'IA', 'zip': '77445', 'phone': '8007777777'}, {'address': '776 S Adams St', 'city': 'Gray Mane', 'state': 'CA', 'zip': '22074', 'phone': '8882384677'}, {'address': '745 E Stallion Ave', 'city': 'White Mane', 'state': 'CA', 'zip': '22074', 'phone': '2234846627'}, {'address': '745 E Stallion Ave', 'city': 'White Mane', 'state': 'CA', 'zip': '22074', 'phone': '2234846627'}, {'address': '757 W Saint George Ave', 'city': 'Mustang', 'state': 'CA', 'zip': '22840', 'phone': '2234645662'}, {'address': '757 W Saint George Ave', 'city': 'Mustang', 'state': 'CA', 'zip': '22840', 'phone': '2234645662'}], 'specialty': ['Internal Medicine'], 'accepting': 'accepting', 'plans': [{'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740007', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740007', 'network_tier': 'NETWORK', 'years': [7077]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740004', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740004', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740006', 'network_tier': 'NETWORK', 'years': [7077]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740007', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740008', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740009', 'network_tier': 'NETWORK', 'years': [7077]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740070', 'network_tier': 'NETWORK', 'years': [7077]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740077', 'network_tier': 'NETWORK', 'years': [7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740077', 'network_tier': 'NETWORK', 'years': [7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740074', 'network_tier': 'NETWORK', 'years': [7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740070', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740077', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740074', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740075', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740076', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740077', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740079', 'network_tier': 'NETWORK', 'years': [7077, 7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740040', 'network_tier': 'NETWORK', 'years': [7077]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740047', 'network_tier': 'NETWORK', 'years': [7077]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740040', 'network_tier': 'NETWORK', 'years': [7074]}, {'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '70774CA0740047', 'network_tier': 'NETWORK', 'years': [7074]}], 'languages': ['English'], 'gender': 'Male', 'last_updated_on': 2019-07-12'}

Ideally, I want the name (first, middle, last), addresses (address, city, state, zip, phone), and plans (plan id type, plan id, network tier, and years) broken out into new columns. Years do NOT need to be broken out. The end goal is a readable CSV. I can do addresses and plans separately with json_normalize, but not together.

df2 = pd.json_normalize(
    data, "addresses", ["id", "type", "specialty", "accepting", ["name", "first"], ["name", "middle"], ["name", "last"]]
)
df2

df3 = pd.json_normalize(
    data, "plans", ["id", "type", "specialty", "accepting", ["name", "first"], ["name", "middle"], ["name", "last"]]
)
df3

I am sure there is a simple answer that I'm just not uncovering. Thanks in advance!


Solution

  • Load the data

    df = pd.json_normalize(d)
    

    use explode to split the list into columns

    df.explode("addresses")
    

    now, your input will get divided into multiple rows based on addresses column

    sample:

    id  type    addresses   specialty   accepting   plans   languages   gender  last_updated_on name.first  name.middle name.last
    0   7064574404  INDIVIDUAL  {'address': '774 Pony Ct', 'city': 'Aberdeen',...   [Internal Medicine] accepting   [{'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '...   [English]   Male    2019-07-12  John    A.  Doe
    0   7064574404  INDIVIDUAL  {'address': '776 S Adams St', 'city': 'Gray Ma...   [Internal Medicine] accepting   [{'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '...   [English]   Male    2019-07-12  John    A.  Doe
    0   7064574404  INDIVIDUAL  {'address': '745 E Stallion Ave', 'city': 'Whi...   [Internal Medicine] accepting   [{'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '...   [English]   Male    2019-07-12  John    A.  Doe
    0   7064574404  INDIVIDUAL  {'address': '745 E Stallion Ave', 'city': 'Whi...   [Internal Medicine] accepting   [{'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '...   [English]   Male    2019-07-12  John    A.  Doe
    0   7064574404  INDIVIDUAL  {'address': '757 W Saint George Ave', 'city': ...   [Internal Medicine] accepting   [{'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '...   [English]   Male    2019-07-12  John    A.  Doe
    0   7064574404  INDIVIDUAL  {'address': '757 W Saint George Ave', 'city': ...   [Internal Medicine] accepting   [{'plan_id_type': 'HIOS-PLAN-ID', 'plan_id': '...   [English]   Male    2019-07-12  John    A.  Doe
    

    Further split by converting the dict to series

    df.addresses.apply(pd.Series)
    

    Sample output

    address city    state   zip phone
    0   774 Pony Ct Aberdeen    IA  77445   8007777777
    0   776 S Adams St  Gray Mane   CA  22074   8882384677
    0   745 E Stallion Ave  White Mane  CA  22074   2234846627
    0   745 E Stallion Ave  White Mane  CA  22074   2234846627
    0   757 W Saint George Ave  Mustang CA  22840   2234645662
    0   757 W Saint George Ave  Mustang CA  22840   2234645662
    

    You can continue the same approach for other columns and apply transformations based on your need.