Search code examples
pythonpandaspearson-correlation

How to calculate the correlation coefficient of grouped quantities in Pandas?


I have a DataFrame in which each row represents a traffic accident. Two of the columns are Speed_limit and Number_of_casualties. I would like to compute the Pearson correlation coefficient between the speed limit and the ratio of the number of casualties to accidents for each speed limit.

My solution so far is to get the relevant quantities as arrays and use SciPy's pearsonr:

import pandas as pd
import scipy.stats

df = pd.DataFrame({'Speed_limit': [10, 10, 20, 20, 20, 30],
                   'Number_of_casualties': [1, 2, 3, 4, 1, 4]})

accidents_per_speed_limit = df['Speed_limit'].value_counts().sort_index()

number_of_casualties_per_speed_limit = df.groupby('Speed_limit').sum()['Number_of_casualties']

speed_limit = accidents_per_speed_limit.index
ratio = number_of_casualties_per_speed_limit.values / accidents_per_speed_limit.values

r, _ = scipy.stats.pearsonr(x=speed_limit, y=ratio)

print("The Pearson's correlation coefficient between the number of casualties per accidents and the speed limit is {r}.".format(r=r))

However, it would seem to me that it should be possible to do this more elegantly using the pandas.DataFrame.corr method. How could I refactor this code to make it more pandas-like?


Solution

  • Instead of count and sum you can use directly use mean of groupby data then use series corr (by default method is pearson) i.e

    m = df.groupby('Speed_limit').mean().reset_index()
    m['Speed_limit'].corr(m['Number_of_casualties'])
    

    Output :

    0.99926008128973687