Search code examples
pythonsortingmachine-learningheatmapcorrelation

Age range to numerical values to calcutate Correlation of CD consumption with age range


I did sort the values. But the problem is 'до 25' (up to 25). How can i change it into '0-25' and calculate correlation coefficient of age group and overall rating.

Some of my data is below

Age group Overall rating
65 and older 38.45
55-64 17.66
up to 25 46.56
45-54 24.95
35-44 33.54
25-34 37.21

Solution

  • Below is how you can do what you ask. I converted your age categories to mean age because correlation requires two numeric values; a category will not work for correlation. There are some other problems with your data. It is unclear what the 65 and older class really is numerically. I made it 65-100 but that may not be the case. You also have your categories set at 25-34 for example. It should be 25-35 because 25-35 does not contain 35 it contains 25, 26, 27, 28, 29, 30, 31, 32, 33 and 34 which is what I think you are trying to achieve. I did not change this but you should change it if that is what you are trying to achieve.

    import pandas as pd
    from scipy.stats import pearsonr
    import warnings
    warnings.filterwarnings("ignore")
    
    Agelst=['65 and older','55-64','up to 25','45-54','35-44','25-34']
    Ratelst=[38.45,17.66,46.56,24.95,33.54,37.21]
    
    df=pd.DataFrame()
    df['Age_Group']=Agelst
    df['Overal_Rating']=Ratelst
    
    display(df)
    
    #Change 'up to 25' to '0-25'
    df.replace('up to 25', '0-25',inplace=True)
    df.replace('65 and older', '65-100',inplace=True)
    
    display(df)
    
    #You will need a numeric age to use for correlation.  We can develop one from the strings in your 'Age_Group'
    loweragelst=[]
    upperagelst=[]
    for i in range(len(df)):
        loweragelst.append(int(((df.iloc[i]['Age_Group']).split('-'))[0]))
        upperagelst.append(int(((df.iloc[i]['Age_Group']).split('-'))[1]))
    
    df['Lower_Age']=loweragelst
    df['Upper_Age']=upperagelst
    
    #Sort the df
    df.sort_values(by=['Lower_Age'], ascending=True,inplace=True)
    display(df)
    
    #Add a mean age column to use for correlation
    df['Mean_Age']=(df['Lower_Age']+df['Upper_Age'])/2
    
    display(df)
    
    #Calculate Pearson's Correlation
    X=df['Mean_Age']
    Y=df['Overal_Rating']
    PCor= pearsonr(X, Y)
    print(PCor)
    

    The resulting df and correlation are:

    Age_Group   Overal_Rating
    0   65 and older    38.45
    1   55-64   17.66
    2   up to 25    46.56
    3   45-54   24.95
    4   35-44   33.54
    5   25-34   37.21
        Age_Group   Overal_Rating
    0   65-100  38.45
    1   55-64   17.66
    2   0-25    46.56
    3   45-54   24.95
    4   35-44   33.54
    5   25-34   37.21
        Age_Group   Overal_Rating   Lower_Age   Upper_Age
    2   0-25    46.56   0   25
    5   25-34   37.21   25  34
    4   35-44   33.54   35  44
    3   45-54   24.95   45  54
    1   55-64   17.66   55  64
    0   65-100  38.45   65  100
        Age_Group   Overal_Rating   Lower_Age   Upper_Age   Mean_Age
    2   0-25    46.56   0   25  12.5
    5   25-34   37.21   25  34  29.5
    4   35-44   33.54   35  44  39.5
    3   45-54   24.95   45  54  49.5
    1   55-64   17.66   55  64  59.5
    0   65-100  38.45   65  100     82.5
    
    PearsonRResult(statistic=-0.4489402583278369, pvalue=0.37183097344063043)