Good day all! I am trying to flatten some nested JSON using json_normalize, but I the output I keep getting is not what I need. Here's my code so far:
df1 = pd.read_csv('data_file.csv')
groups_dict = df1['groups']
df2 = pd.json_normalize(groups_dict)
The bit where the dictionary gets created seems to be working as seen here:
groups_dict.info()
groups_dict.head()
<class 'pandas.core.series.Series'>
RangeIndex: 19 entries, 0 to 18
Series name: groups
Non-Null Count Dtype
-------------- -----
19 non-null object
dtypes: object(1)
memory usage: 280.0+ bytes
0 [{'group_id': 798800, 'name': 'Clickers 1 '}]
1 [{'group_id': 798803, 'name': 'Clickers 2'}]
2 [{'group_id': 848426, 'name': 'Colin Safe Brow...
3 [{'group_id': 798804, 'name': 'Clickers 3'}]
4 [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object
But when I try to normalize the dictionary, I get the following output:
df2 = pd.json_normalize(groups_dict)
df2.head()
0
1
2
3
4
I need to have each item from the groups column listed as it's own column to complete my project. Please see example below for sample data file (csv format) and what I am trying to accomplish.
CSV:
campaign_id,name,groups,status,content,duration_type,start_date,end_date,relative_duration,auto_enroll,allow_multiple_enrollments,completion_percentage
201644,Clicker 1 Retraining ,"[{'group_id': 798800, 'name': 'Clickers 1 '}]",Closed,"[{'store_purchase_id': 1076203, 'content_type': 'Store Purchase', 'name': 'Spot the Phish Game: Foundational', 'description': 'Make sure you can spot a phishing attempt by using this condensed Spot the Phish game. With ten...', 'type': 'Game', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-10-02T17:08:16.000Z', 'publisher': 'APP1', 'purchase_date': '2022-04-13T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-19T08:00:00.000Z,,1 weeks,TRUE,FALSE,14
201645,Clicker 2 Retraining ,"[{'group_id': 798803, 'name': 'Clickers 2'}]",In Progress,"[{'store_purchase_id': 1060139, 'content_type': 'Store Purchase', 'name': 'Micro-module – Social Engineering', 'description': 'This five-minute micro-module defines social engineering and describes what criminals are after....', 'type': 'Training Module', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-09-09T16:06:01.000Z', 'publisher': 'APP2', 'purchase_date': '2022-03-21T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-13T08:00:00.000Z,,1 weeks,TRUE,FALSE,0
Before script:
df1['groups'].head()
0 [{'group_id': 798800, 'name': 'Clickers 1 '}]
1 [{'group_id': 798803, 'name': 'Clickers 2'}]
2 [{'group_id': 848426, 'name': 'Colin Safe Brow...
3 [{'group_id': 798804, 'name': 'Clickers 3'}]
4 [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object
After script:
df2.head()
group_id name
0 798800 Clickers 1
1 798803 Clickers 2
2 848426 Colin Safe Brow...
3 798804 Clickers 3
4 855348 Email Whitelist...
Anyone have pointers on how I should proceed?
Any assistance would be greatly appreaciated. Thanks!
You need to first extract the nested dict from its str
representation by using eval
or ast.literal_eval
from the ast
module.
You can then create a separate dataframe from the column you want by doing:
import ast
df1['groups'] = df1['groups'].apply(ast.literal_eval)
However, this returns a list of a single dict in your dataset. To combat this, we'll extract the first element of each row.
df1['groups'] = df1['groups'].apply(lambda l: l[0])
df2 = df1['groups'].apply(pd.Series)
Then you can access individual columns such as group_id
and name
using:
df2['group_id']
df2['name'] # etc.
group_id
0 798800
1 798803
2 848426
3 798804
4 855348
Similarly for other columns within your nested dict.