Search code examples
pythondataframejson-normalize

Create dataframe columns from JSON within CSV column


Good day all! I am trying to flatten some nested JSON using json_normalize, but I the output I keep getting is not what I need. Here's my code so far:

df1 = pd.read_csv('data_file.csv')

groups_dict = df1['groups']
df2 = pd.json_normalize(groups_dict)

The bit where the dictionary gets created seems to be working as seen here:

groups_dict.info()
groups_dict.head()

<class 'pandas.core.series.Series'>
RangeIndex: 19 entries, 0 to 18
Series name: groups
Non-Null Count  Dtype 
--------------  ----- 
19 non-null     object
dtypes: object(1)
memory usage: 280.0+ bytes

0   [{'group_id': 798800, 'name': 'Clickers 1 '}]
1   [{'group_id': 798803, 'name': 'Clickers 2'}]
2   [{'group_id': 848426, 'name': 'Colin Safe Brow...
3   [{'group_id': 798804, 'name': 'Clickers 3'}]
4   [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object

But when I try to normalize the dictionary, I get the following output:

df2 = pd.json_normalize(groups_dict)
df2.head()

0
1
2
3
4

I need to have each item from the groups column listed as it's own column to complete my project. Please see example below for sample data file (csv format) and what I am trying to accomplish.

CSV:

campaign_id,name,groups,status,content,duration_type,start_date,end_date,relative_duration,auto_enroll,allow_multiple_enrollments,completion_percentage
201644,Clicker 1 Retraining ,"[{'group_id': 798800, 'name': 'Clickers 1 '}]",Closed,"[{'store_purchase_id': 1076203, 'content_type': 'Store Purchase', 'name': 'Spot the Phish Game: Foundational', 'description': 'Make sure you can spot a phishing attempt by using this condensed Spot the Phish game. With ten...', 'type': 'Game', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-10-02T17:08:16.000Z', 'publisher': 'APP1', 'purchase_date': '2022-04-13T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-19T08:00:00.000Z,,1 weeks,TRUE,FALSE,14
201645,Clicker 2 Retraining  ,"[{'group_id': 798803, 'name': 'Clickers 2'}]",In Progress,"[{'store_purchase_id': 1060139, 'content_type': 'Store Purchase', 'name': 'Micro-module – Social Engineering', 'description': 'This five-minute micro-module defines social engineering and describes what criminals are after....', 'type': 'Training Module', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-09-09T16:06:01.000Z', 'publisher': 'APP2', 'purchase_date': '2022-03-21T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-13T08:00:00.000Z,,1 weeks,TRUE,FALSE,0

Before script:

df1['groups'].head()

0   [{'group_id': 798800, 'name': 'Clickers 1 '}]
1   [{'group_id': 798803, 'name': 'Clickers 2'}]
2   [{'group_id': 848426, 'name': 'Colin Safe Brow...
3   [{'group_id': 798804, 'name': 'Clickers 3'}]
4   [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object

After script:

df2.head()

   group_id   name
0  798800     Clickers 1
1  798803     Clickers 2
2  848426     Colin Safe Brow...
3  798804     Clickers 3
4  855348     Email Whitelist...

Anyone have pointers on how I should proceed?

Any assistance would be greatly appreaciated. Thanks!


Solution

  • You need to first extract the nested dict from its str representation by using eval or ast.literal_eval from the ast module.

    You can then create a separate dataframe from the column you want by doing:

    import ast
    
    df1['groups'] = df1['groups'].apply(ast.literal_eval)
    

    However, this returns a list of a single dict in your dataset. To combat this, we'll extract the first element of each row.

    df1['groups'] = df1['groups'].apply(lambda l: l[0])
    
    df2 = df1['groups'].apply(pd.Series)
    

    Then you can access individual columns such as group_id and name using:

    df2['group_id']
    df2['name']  # etc.   
    
       group_id
    0  798800
    1  798803
    2  848426
    3  798804
    4  855348
    

    Similarly for other columns within your nested dict.