I have a dataset with labels and usernames:
Labels Usernames
1 Londonderry
1 Londoncalling
1 Steveonder43
0 Maryclare_re
1 Patent107391
0 Anonymous
1 _24londonqr
...
I would need to show there is a correlation between usernames containing the word London and label 1. To do it, I created a second label to see where the word London was
for idx, username in df['Usernames']:
if 'London' in username:
df['London'].iloc[idx] = 1
else:
df['London'].iloc[idx] = 0
Then I compared these binary variables, using Pearson correlation coefficient:
import scipy.stats.pearsonr as rho
corr = rho(df['labels'], df['London'])
However it does not work. Am I missing something in the above steps?
You have Labels
in your dataframe but you pass labels
, also I enhance the code by contains
df['London'] = df['Usernames'].str.contains('London').astype(int)
from scipy import stats
stats.pearsonr(df['Labels'], df['London'])
Out[12]: (0.4, 0.37393392381774704)