Search code examples
pythongroup-bysize

Python: how to use result of groupby().size()


I have a Kickstarter Dataset and want to show the success rate in a plot. So there is a Column "Successful" which is either "Yes" or "No" and the different countries.

Succesfull=df[['country', 'succesfull']].groupby(['country', 'succesfull']).size()

with this line I created a new data frame which shows how many projects in a country were successful and how many not. It looks like that:

enter image description here

How can I now calculate the success rate ( Amount of "Yes" / Total Amount) of each country? how can I access the last column which doesn't have a name (there is just a 0).


Solution

  • This may not be the best approach, but this is how I split up the problem.

    1) make a small reproducible example to test different cases

    df = pd.DataFrame({'country': ['AU', 'AU', 'CA', 'CA', 'GB', 'GB'], 'successfull': ['Yes', 'No', 'No', 'No', 'Yes', 'Yes']})
    
    df
    Out[3]: 
      country successfull
    0      AU         Yes
    1      AU          No
    2      CA          No
    3      CA          No
    4      GB         Yes
    5      GB         Yes
    

    2) counting successful projects

    Matching up only the values with 'Yes' and counting them

    success = df[df['successfull'] == 'Yes'].groupby('country').size()
    
    success
    Out[5]: 
    country
    AU    1
    GB    2
    dtype: int64
    

    3) counting total projects

    In a similar fashion to the question posted.

    total = df.groupby('country').size()
    
    total
    Out[7]: 
    country
    AU    2
    CA    2
    GB    2
    dtype: int64
    

    4) Take the ratio of successful projects to total projects

    Once these two series exist the ratio can be taken. If a country does not have a 'Yes', it will become NaN. Replace these with 0, because the success rate is 0 here.

    Successfull = (success/total).replace(np.nan, 0)
    Out[10]: 
    country
    AU    0.5
    CA    0.0
    GB    1.0
    dtype: float64
    

    5) accessing the data

    The Successfull object above is a Pandas.Series. Below are a few examples for accessing. For example, to know how successful a country is, use the country code as the index:

    Successfull['AU']
    Out[11]: 0.5
    

    To get an array of success ratios the .values attribute can be accessed.

    Successfull.values
    Out[12]: array([0.5, 0. , 1. ])
    

    To plot the results

    Successfull.plot()