Search code examples
pythonpandasscipypearson-correlation

Estimate correlation in Python


I have a dataset with labels and usernames:

Labels   Usernames
1         Londonderry
1         Londoncalling
1          Steveonder43
0         Maryclare_re
1         Patent107391
0         Anonymous 
1         _24londonqr
... 

I would need to show there is a correlation between usernames containing the word London and label 1. To do it, I created a second label to see where the word London was

for idx, username in df['Usernames']:
    if 'London' in username:
        df['London'].iloc[idx] = 1
    else:
        df['London'].iloc[idx] = 0

Then I compared these binary variables, using Pearson correlation coefficient:

import scipy.stats.pearsonr as rho
corr = rho(df['labels'], df['London'])

However it does not work. Am I missing something in the above steps?


Solution

  • You have Labels in your dataframe but you pass labels, also I enhance the code by contains

    df['London'] = df['Usernames'].str.contains('London').astype(int)
    from scipy import stats
    stats.pearsonr(df['Labels'], df['London'])
    Out[12]: (0.4, 0.37393392381774704)