I have a Kickstarter Dataset and want to show the success rate in a plot. So there is a Column "Successful" which is either "Yes" or "No" and the different countries.
Succesfull=df[['country', 'succesfull']].groupby(['country', 'succesfull']).size()
with this line I created a new data frame which shows how many projects in a country were successful and how many not. It looks like that:
How can I now calculate the success rate ( Amount of "Yes" / Total Amount) of each country? how can I access the last column which doesn't have a name (there is just a 0).
This may not be the best approach, but this is how I split up the problem.
df = pd.DataFrame({'country': ['AU', 'AU', 'CA', 'CA', 'GB', 'GB'], 'successfull': ['Yes', 'No', 'No', 'No', 'Yes', 'Yes']})
df
Out[3]:
country successfull
0 AU Yes
1 AU No
2 CA No
3 CA No
4 GB Yes
5 GB Yes
Matching up only the values with 'Yes' and counting them
success = df[df['successfull'] == 'Yes'].groupby('country').size()
success
Out[5]:
country
AU 1
GB 2
dtype: int64
In a similar fashion to the question posted.
total = df.groupby('country').size()
total
Out[7]:
country
AU 2
CA 2
GB 2
dtype: int64
Once these two series exist the ratio can be taken. If a country does not have a 'Yes', it will become NaN. Replace these with 0, because the success rate is 0 here.
Successfull = (success/total).replace(np.nan, 0)
Out[10]:
country
AU 0.5
CA 0.0
GB 1.0
dtype: float64
The Successfull object above is a Pandas.Series
. Below are a few examples for accessing. For example, to know how successful a country is, use the country code as the index:
Successfull['AU']
Out[11]: 0.5
To get an array of success ratios the .values
attribute can be accessed.
Successfull.values
Out[12]: array([0.5, 0. , 1. ])
To plot the results
Successfull.plot()