Search code examples
pythonpandasscipypearson-correlation

Correlation using scipy


I have two variables, one called polarity and another called sentiment. I would like to see if exists a correlation between these two variables. polarity can take values from 0 to 1 (continuous); sentiment can take values -1, 0 and 1. I have tried as follows:

from scipy import stats

pearson_coef, p_value = stats.pearsonr(df['polarity'], df['sentiment']) 
print(pearson_coef)

but I have got the following error:

TypeError: unsupported operand type(s) for +: 'float' and 'str'

Example of values:

polarity      sentiment
 
0.34            -1
0.12            -1
0.85             1
0.76             1
0.5              0
0.21             0

Solution

  • Since, you are dealing with a dataframe, you can do the following to learn the dtypes of the columns:

    >>> df.info() 
    
     #   Column     Non-Null Count  Dtype  
    ---  ------     --------------  -----  
     0   polarity   6 non-null      float64
     1   sentiment  6 non-null      object 
    
    >>> df['sentiment'] = df.sentiment.map(float) # or do : df = df.astype(float)
    
    >>> df.info()
    
     #   Column     Non-Null Count  Dtype  
    ---  ------     --------------  -----  
     0   polarity   6 non-null      float64
     1   sentiment  6 non-null      float64
    
    
    >>> pearson_coef, p_value = stats.pearsonr(df['polarity'], df['sentiment']) 
    >>> print(pearson_coef)
    0.870679269711991
    
    # Moreover, you can use pandas to estimate 'pearsonr' correlation matrix if you want to:
    >>> df.corr()
    
               polarity  sentiment
    polarity   1.000000   0.870679
    sentiment  0.870679   1.000000