Search code examples
pythonjsoncsvnormalize

Flatten JSON nested key value pairs for conversion to csv


I have a pretty typical json file, except a couple of items contain nested label/value pairs. Data sample:

{
   "total_count":10000,
   "count_from":1,
   "count_to":1000,
   "contacts":[
      {
         "contact_id":"ABC123",
         "contact_last_name":"Last name",
         "contact_email":[
            {
               "label":"primary",
               "value":"last.first@example.com"
            },
            {
               "label":"Secondary",
               "value":"first.last@example.com"
            },
            {
               "label":"personal",
               "value":"last.first@gmail.com"
            }
         ],
         "research_mail":"last.first@yahoo.com",
         "contact_phone":[
            {
               "label":"Desk/Work",
               "value":"2015555555"
            },
            {
               "label":"Mobile",
               "value":"2015555556"
            },
            {
               "label":"Other/Home",
               "value":"2015555557"
            }
         ],
         "contact_address1":"3rd street",
         "contact_asst_name":"",
         "contact_asst_phone":""
      }
   ]
}

I'm using pandas json_normalize to create a data set before finally creating the csv.

    import json
    import csv
    import pandas as pd

    with open("sourcefilename") as f:  
        data = json.load(f)
    info_df = pd.json_normalize(data, 'contacts')

The contact_email and research_email values each return as individual columns, with the raw JSON data for the column contents.

contact_id,contact_last_name,contact_email,research_mail, etc.

My desired output columns for that data is

contact_id,contact_last_name,contact_email_primary,contact_email_secondary,contact_email_personal,research_mail, etc.

Could someone suggest the best way to approach it? I would prefer to continue to use pandas and json_normalize if possible.


Solution

  • A simple approach there is to change the lists into dicts so that json_normalize can more easily apply its magic:

    with a quick re-formatter it could give

    def list2dic(inconvenient_list):
        reformatted_dic = {}
    
        for item in inconvenient_list:
            reformatted_dic[item['label']] = item['value']
    
        return reformatted_dic
    

    Apply it to the contact_email and contact_phone fields:

    for contact in data['contacts']:
        contact['contact_email'] = list2dict(contact['contact_email'])
        contact['contact_phone'] = list2dict(contact['contact_phone'])
    

    and now json_normalize should work just fine (changing the separator from '.' to '_')

    info_df = pd.json_normalize(data, 'contacts', sep='_')