Search code examples
correlation

how to pass a list [ ] with sepcific column values for pearson correlation calculations in Pandas


I'm an intro to Data Science in Python class from Coursera online. This exercise requires Pearson correlation only for two specific columns. Attached code is erroring

a) is it because these columns some how have "NaN" value in them - ? b) is there discrepancy btw an 'int' & 'tuple' is causing this error.

Is error occurs on an auto grader for class assignment Please provide insights on how correct this.

I've added df.where condition and print value-count , where i see for sure "Nan" are excluded

import pandas as pd  # just need to provide corr  
from scipy import stats
import numpy as np
a = pd.read_csv('NISPUF17.csv',usecols = ['HAD_CPOX','P_NUMVRC']).dropna()
a = a.query('HAD_CPOX != 77')
print(len(a)) #15286 
#print(a.value_counts().sum()) #15286
# print(a.value_counts())
a.sort_index(inplace=True)
def lengths():
    corr = stats.pearsonr(a[:,0],a[:,1])
    return corr
print(lengths())

Solution

  • If no NaN in the columns:

    import pandas as pd  # just need to provide corr  
    from scipy import stats
    import numpy as np
    a = pd.read_csv('NISPUF17.csv',usecols = ['HAD_CPOX','P_NUMVRC']).dropna()
    a = a.query('HAD_CPOX != 77')
    print(len(a)) #15286 
    #print(a.value_counts().sum()) #15286
    # print(a.value_counts())
    a.sort_index(inplace=True)
    # def lengths():
    #     corr = stats.pearsonr(a[:,0],a[:,1])
    #     return corr
    # print(lengths())
    
    a = pd.DataFrame({'a': [1, 2, 3], 'b': [0, 0, 1]})
    
    def lengths():
        corr = stats.pearsonr(a['a'],a['b'])
        return corr
    
    print(lengths())
    

    If does not work, check:

    https://stackoverflow.com/a/48591908/1614089