I have two variables, one called polarity
and another called sentiment
. I would like to see if exists a correlation between these two variables.
polarity
can take values from 0
to 1
(continuous); sentiment
can take values -1, 0
and 1
.
I have tried as follows:
from scipy import stats
pearson_coef, p_value = stats.pearsonr(df['polarity'], df['sentiment'])
print(pearson_coef)
but I have got the following error:
TypeError: unsupported operand type(s) for +: 'float' and 'str'
Example of values:
polarity sentiment
0.34 -1
0.12 -1
0.85 1
0.76 1
0.5 0
0.21 0
Since, you are dealing with a dataframe
, you can do the following to learn the dtypes
of the columns:
>>> df.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 polarity 6 non-null float64
1 sentiment 6 non-null object
>>> df['sentiment'] = df.sentiment.map(float) # or do : df = df.astype(float)
>>> df.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 polarity 6 non-null float64
1 sentiment 6 non-null float64
>>> pearson_coef, p_value = stats.pearsonr(df['polarity'], df['sentiment'])
>>> print(pearson_coef)
0.870679269711991
# Moreover, you can use pandas to estimate 'pearsonr' correlation matrix if you want to:
>>> df.corr()
polarity sentiment
polarity 1.000000 0.870679
sentiment 0.870679 1.000000