I have a data frame like this:
user_id action action_type action_detail device_type secs_elapsed
0 d1mm9tcy42 lookup Missing Missing Windows Desktop 319
1 d1mm9tcy42 search_results click view_search_results Windows Desktop 67753
2 d1mm9tcy42 lookup Missing Missing Windows Desktop 301
3 d1mm9tcy42 search_results click view_search_results Windows Desktop 22141
4 d1mm9tcy42 lookup Missing Missing Windows Desktop 435
5 d1mm9tcy42 search_results click view_search_results Windows Desktop 7703
6 d1mm9tcy42 lookup Missing Missing Windows Desktop 115
7 d1mm9tcy42 personalize data wishlist_content_update Windows Desktop 831
8 d1mm9tcy42 index view view_search_results Windows Desktop 20842
9 d1mm9tcy42 lookup Missing Missing Windows Desktop 683
I want to set up a bar chart which has on the x axis the categorical columns e.g. action
, action_type
and action_detail
, and on the y axis the percentage count (for each column) of the number of rows which have the values Missing
, Unknown
(you cant see this here but some columns do have that value) and Other
(anything which is not Missing
or Unknown
).
One thing I am struggling with is also how to see, for each value in the action
column, what is the % of the action_type
and action_detail
respectively that are Missing or Unknown. e.g. the action lookup
occurs 100 times, and for these times 20% of the time there is a Missing
action_type
etc.
I have got somewhere with this via this type of code:
print("The percentage of missing action types is {0}".format
(((clean_sessions['action_type'] == 'Missing').value_counts())/(clean_sessions['action_type'].count())
))
But I want to bring my analysis to the next level.
('Missing', 'Unknown', 'Other')
.value_counts
on each column.nan
instead of 0
when a value is not in column so you might want to use fillna(0)
at the end.-
result = (df[['action', 'action_type', 'action_detail']]
.where(df.isin(('Missing', 'Unknown')), 'Other')
.apply(lambda x: x.value_counts(normalize=True))
.fillna(0))
print(result)
action action_type action_detail
Missing 0 0.5 0.5
Other 1 0.5 0.5
result.T.plot(kind='bar', stacked=True)