Search code examples
pythonpandasnormalize

Pandas json_normalize to flatten a dictionary with values as columns


Here's a sample of the data I'm trying to flatten:

location = [{'id': 225,
  'country': 'US',
  'country_code': 'US',
  'country_population': 327167434,
  'province': '',
  'last_updated': '2020-05-06T11:33:46.184263Z',
  'coordinates': {'latitude': '37.0902', 'longitude': '-95.7129'},
  'latest': {'confirmed': 1204351, 'deaths': 71064, 'recovered': 0},
  'timelines': {'confirmed': {
                  'latest': 1204351,
                  'timeline': {
                     '2020-01-22T00:00:00Z': 1,
                     '2020-01-23T00:00:00Z': 1,
                     '2020-01-24T00:00:00Z': 2}
                    },
                'deaths': {
                  'latest': 71064,
                  'timeline': {
                     '2020-01-22T00:00:00Z': 0,
                     '2020-01-23T00:00:00Z': 0,
                     '2020-01-24T00:00:00Z': 0}
                     }
                    }
                    }]

I'm unable to get the timeline for confirmed & deaths in a usable format. I either get a column for each date with this:

pd.json_normalize(
    data = location,
    record_path=['timelines','confirmed','timeline'])

Or a row with date and no value for the count:

pd.json_normalize(data = location[0]['timelines']['confirmed']) 

Desired output is something like:

enter image description here

Thanks in advance; any assistance is much appreciated.


Solution

  • I'm interested in wrangling/reshaping with Pandas and wanted to try a solution via that method. It didn't make sense to duplicate the latest value with each timeline date row so I broke the data into 2 frames: latest & timeline. Still fairly new to this so any feedback/potential improvements are appreciated.

    df = pd.json_normalize(data = location, sep = '_')
    
    #list of current state / non-timeline columns
    latest_cols = ['id', 'country', 'country_code', 'country_population', 'province', 'last_updated',
                  'coordinates_latitude', 'coordinates_longitude', 'latest_confirmed', 'latest_deaths',
                  'latest_recovered', 'timelines_confirmed_latest', 'timelines_deaths_latest',
                  'timelines_recovered_latest']
    covid_latest = df[latest_cols]
    
    # remove id & country from latest cols for inclusion w/ timeline
    latest_cols = latest_cols[2:]
    covid_timelines = df.drop(columns = latest_cols)
    
    # reshape from wide to long
    covid_timelines = pd.melt(
        frame = covid_timelines,
        id_vars = ['id', 'country'],
        var_name = 'status_date',
        value_name = 'case_count')
    
    # split the timelines value and keep columns with date and type
    timeline_cols = covid_timelines.status_date.str.split('_', expand=True)
    timeline_cols.columns = ['timelines', 'case_type', 'timeline', 'str_date']
    timeline_cols = timeline_cols.drop(columns = ['timelines', 'timeline'])
    
    
    #add split cols back to covid_timelines
    covid_timelines = pd.concat([covid_timelines, timeline_cols], axis=1)
    
    # string to date and drop the dup date column
    covid_timelines.status_date = pd.to_datetime(timeline_cols.str_date).dt.date      
    covid_timelines = covid_timelines.drop(columns = 'str_date')
    
    #pivot death & confirmed from rows to cols
    covid_timelines.pivot_table(
        index = ['id', 'country', 'status_date'],
        columns = 'case_type',
        values = 'case_count' )
    

    enter image description here