Search code examples
pythonpandasjson-normalize

How to split a pandas column with a list of dicts into separate columns for each key


I'm analyzing Political Advertisements from Facebook, which is a dataset released here, by ProPublica.

Here's what I mean. I have an entire column of targets that I want to analyze but it's formatted in a very inaccessible way for someone of my skill level.

This is from just 1 cell: [{"target": "NAge", "segment": "21 and older"}, {"target": "MinAge", "segment": "21"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]

and another: [{"target": "NAge", "segment": "18 and older"}, {"target": "Location Type", "segment": "HOME"}, {"target": "Interest", "segment": "Hispanic culture"}, {"target": "Interest", "segment": "Republican Party (United States)"}, {"target": "Location Granularity", "segment": "country"}, {"target": "Country", "segment": "the United States"}, {"target": "MinAge", "segment": 18}]

What I need to do is to separate every "target" item to become the column label with each of its corresponding "segment" to be a possible value within that column.

Or, is the solution to create a function to call each dictionary key within each row to count frequency?


Solution

    • The columns are lists of dicts.
      • Each dict in the list can be moved to a separate column by using pandas.explode().
      • Convert the column of dicts to a dataframe where the keys are column headers and the values are observations, by using pandas.json_normalize(), .join() this back to df.
    • Use .drop() to remove the unneeded column.
    • If the column contains list of dicts that are strings (e.g. "[{key: value}]"), refer to this solution in Splitting dictionary/list inside a Pandas Column into Separate Columns, and use:
      • df.col2 = df.col2.apply(literal_eval), with from ast import literal_eval.
    import pandas as pd
    
    # create sample dataframe
    df = pd.DataFrame({'col1': ['x', 'y'], 'col2': [[{"target": "NAge", "segment": "21 and older"}, {"target": "MinAge", "segment": "21"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}], [{"target": "NAge", "segment": "18 and older"}, {"target": "Location Type", "segment": "HOME"}, {"target": "Interest", "segment": "Hispanic culture"}, {"target": "Interest", "segment": "Republican Party (United States)"}, {"target": "Location Granularity", "segment": "country"}, {"target": "Country", "segment": "the United States"}, {"target": "MinAge", "segment": 18}]]})
    
    # display(df)
      col1                                                                                                                                                                                                                                                                                                                                                                                 col2
    0    x                                                                                                                                                   [{'target': 'NAge', 'segment': '21 and older'}, {'target': 'MinAge', 'segment': '21'}, {'target': 'Retargeting', 'segment': 'people who may be similar to their customers'}, {'target': 'Region', 'segment': 'the United States'}]
    1    y  [{'target': 'NAge', 'segment': '18 and older'}, {'target': 'Location Type', 'segment': 'HOME'}, {'target': 'Interest', 'segment': 'Hispanic culture'}, {'target': 'Interest', 'segment': 'Republican Party (United States)'}, {'target': 'Location Granularity', 'segment': 'country'}, {'target': 'Country', 'segment': 'the United States'}, {'target': 'MinAge', 'segment': 18}]
    
    # use explode to give each dict in a list a separate row
    df = df.explode('col2', ignore_index=True)
    
    # normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column
    df = df.join(pd.json_normalize(df.col2)).drop(columns=['col2'])
    

    display(df)

       col1                target                                       segment
    0     x                  NAge                                  21 and older
    1     x                MinAge                                            21
    2     x           Retargeting  people who may be similar to their customers
    3     x                Region                             the United States
    4     y                  NAge                                  18 and older
    5     y         Location Type                                          HOME
    6     y              Interest                              Hispanic culture
    7     y              Interest              Republican Party (United States)
    8     y  Location Granularity                                       country
    9     y               Country                             the United States
    10    y                MinAge                                            18
    

    Get count

    • If the goal is to get the count for each 'target' and associated 'segment'
    counts = df.groupby(['target', 'segment']).count()
    

    Updated

    • This update is implemented for the full file
    import pandas as pd
    from ast import literal_eval
    
    # load the file
    df = pd.read_csv('en-US.csv')
    
    # replace NaNs with '[]', otherwise literal_eval will error
    df.targets = df.targets.fillna('[]')
    
    # replace null with None, otherwise literal_eval will error
    df.targets = df.targets.str.replace('null', 'None')
    
    # convert the strings to lists of dicts
    df.targets = df.targets.apply(literal_eval)
    
    # use explode to give each dict in a list a separate row
    df = df.explode('targets', ignore_index=True)
    
    # fillna with {} is required for json_normalize
    df.targets = df.targets.fillna({i: {} for i in df.index})
    
    # normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column
    normalized = pd.json_normalize(df.targets)
    
    # get the counts
    counts = normalized.groupby(['target', 'segment']).segment.count().reset_index(name='counts')