I have a DataFrame
in which each row represents a traffic accident. Two of the columns are Speed_limit
and Number_of_casualties
. I would like to compute the Pearson correlation coefficient between the speed limit and the ratio of the number of casualties to accidents for each speed limit.
My solution so far is to get the relevant quantities as arrays and use SciPy's pearsonr
:
import pandas as pd
import scipy.stats
df = pd.DataFrame({'Speed_limit': [10, 10, 20, 20, 20, 30],
'Number_of_casualties': [1, 2, 3, 4, 1, 4]})
accidents_per_speed_limit = df['Speed_limit'].value_counts().sort_index()
number_of_casualties_per_speed_limit = df.groupby('Speed_limit').sum()['Number_of_casualties']
speed_limit = accidents_per_speed_limit.index
ratio = number_of_casualties_per_speed_limit.values / accidents_per_speed_limit.values
r, _ = scipy.stats.pearsonr(x=speed_limit, y=ratio)
print("The Pearson's correlation coefficient between the number of casualties per accidents and the speed limit is {r}.".format(r=r))
However, it would seem to me that it should be possible to do this more elegantly using the pandas.DataFrame.corr method. How could I refactor this code to make it more pandas
-like?
Instead of count and sum you can use directly use mean
of groupby data then use series corr
(by default method is pearson) i.e
m = df.groupby('Speed_limit').mean().reset_index()
m['Speed_limit'].corr(m['Number_of_casualties'])
Output :
0.99926008128973687