I did sort the values. But the problem is 'до 25' (up to 25). How can i change it into '0-25' and calculate correlation coefficient of age group and overall rating.
Some of my data is below
Age group | Overall rating |
---|---|
65 and older | 38.45 |
55-64 | 17.66 |
up to 25 | 46.56 |
45-54 | 24.95 |
35-44 | 33.54 |
25-34 | 37.21 |
Below is how you can do what you ask. I converted your age categories to mean age because correlation requires two numeric values; a category will not work for correlation. There are some other problems with your data. It is unclear what the 65 and older class really is numerically. I made it 65-100 but that may not be the case. You also have your categories set at 25-34 for example. It should be 25-35 because 25-35 does not contain 35 it contains 25, 26, 27, 28, 29, 30, 31, 32, 33 and 34 which is what I think you are trying to achieve. I did not change this but you should change it if that is what you are trying to achieve.
import pandas as pd
from scipy.stats import pearsonr
import warnings
warnings.filterwarnings("ignore")
Agelst=['65 and older','55-64','up to 25','45-54','35-44','25-34']
Ratelst=[38.45,17.66,46.56,24.95,33.54,37.21]
df=pd.DataFrame()
df['Age_Group']=Agelst
df['Overal_Rating']=Ratelst
display(df)
#Change 'up to 25' to '0-25'
df.replace('up to 25', '0-25',inplace=True)
df.replace('65 and older', '65-100',inplace=True)
display(df)
#You will need a numeric age to use for correlation. We can develop one from the strings in your 'Age_Group'
loweragelst=[]
upperagelst=[]
for i in range(len(df)):
loweragelst.append(int(((df.iloc[i]['Age_Group']).split('-'))[0]))
upperagelst.append(int(((df.iloc[i]['Age_Group']).split('-'))[1]))
df['Lower_Age']=loweragelst
df['Upper_Age']=upperagelst
#Sort the df
df.sort_values(by=['Lower_Age'], ascending=True,inplace=True)
display(df)
#Add a mean age column to use for correlation
df['Mean_Age']=(df['Lower_Age']+df['Upper_Age'])/2
display(df)
#Calculate Pearson's Correlation
X=df['Mean_Age']
Y=df['Overal_Rating']
PCor= pearsonr(X, Y)
print(PCor)
The resulting df and correlation are:
Age_Group Overal_Rating
0 65 and older 38.45
1 55-64 17.66
2 up to 25 46.56
3 45-54 24.95
4 35-44 33.54
5 25-34 37.21
Age_Group Overal_Rating
0 65-100 38.45
1 55-64 17.66
2 0-25 46.56
3 45-54 24.95
4 35-44 33.54
5 25-34 37.21
Age_Group Overal_Rating Lower_Age Upper_Age
2 0-25 46.56 0 25
5 25-34 37.21 25 34
4 35-44 33.54 35 44
3 45-54 24.95 45 54
1 55-64 17.66 55 64
0 65-100 38.45 65 100
Age_Group Overal_Rating Lower_Age Upper_Age Mean_Age
2 0-25 46.56 0 25 12.5
5 25-34 37.21 25 34 29.5
4 35-44 33.54 35 44 39.5
3 45-54 24.95 45 54 49.5
1 55-64 17.66 55 64 59.5
0 65-100 38.45 65 100 82.5
PearsonRResult(statistic=-0.4489402583278369, pvalue=0.37183097344063043)