Search code examples
pandasnumpyscipyfeature-selectionpearson-correlation

How to get rid of 'ValueError: all the input array dimensions for the concatenation axis must match exactly' during Pearson Correlation calculation?


I'm trying to calculate the Pearson Correlation based on the gist provided here. Oddly getting ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 52 and the array at index 1 has size 1 error (the data frame has 52 records).

Here is the provided function:

def cor_selector(X, y, num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1] # error happens during the 2nd call to here
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature

Here is my script:

df = pd.read_csv(DATA_CSV) # shape: (52, 5)
X = df[['a', 'b', 'c']]
y = df[['d']]
num_feats = 3
cor_support, cor_feature = cor_selector(X, y, num_feats)
print(str(len(cor_feature)), 'selected features')

Full stack trace:

Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py", line 1438, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/talha/PycharmProjects/covid19/store_data.py", line 275, in <module>
    cor_support, cor_feature = cor_selector(X, y, num_feats)
  File "/Users/talha/PycharmProjects/covid19/store_data.py", line 254, in cor_selector
    cor = np.corrcoef(X[i], y)[0, 1]
  File "<__array_function__ internals>", line 6, in corrcoef
  File "/Users/talha/.local/share/virtualenvs/covid19-g87yyZJK/lib/python3.7/site-packages/numpy/lib/function_base.py", line 2526, in corrcoef
    c = cov(x, y, rowvar)
  File "<__array_function__ internals>", line 6, in cov
  File "/Users/talha/.local/share/virtualenvs/covid19-g87yyZJK/lib/python3.7/site-packages/numpy/lib/function_base.py", line 2390, in cov
    X = np.concatenate((X, y), axis=0)
  File "<__array_function__ internals>", line 6, in concatenate

Solution

  • It seems that you're passing a series at index 0 and a dataframe at index 1 to np.corrcoef. In your script, change y = df[['d']] to y = df['d'] and it should work.