Search code examples
python-3.xscikit-learnfeature-extractionsklearn-pandasfeature-engineering

Problem with negative numbers in sklearn.feature_selection.SelectKBest feautre scoring module


I was trying auto feature engineering and selecting, so for that, I used the Boston house price dataset available in sklearn.

from sklearn.datasets import load_boston
import pandas as pd
data = load_boston()
x = data.data
y= data.target
y = pd.DataFrame(y)

Then I implemented the feature transformation library on the dataset.

import autofeat as af
clf = af.AutoFeatRegressor()
df = clf.fit_transform(x,y)
df = pd.DataFrame(df)

After this, I implemented another function to find the score of each feature in relation to the label.

from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(chi2, k=20)
X_new_done = X_new.fit_transform(df,y)
dfscores = pd.DataFrame(X_new.scores_)
dfcolumns = pd.DataFrame(X_new_done.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
print(featureScores.nlargest(10,'Score'))

This gave error as following.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-b0fa1556bdef> in <module>()
      1 from sklearn.feature_selection import SelectKBest, chi2
      2 X_new = SelectKBest(chi2, k=20)
----> 3 X_new_done = X_new.fit_transform(df,y)
      4 dfscores = pd.DataFrame(X_new.scores_)
      5 dfcolumns = pd.DataFrame(X_new_done.columns)

ValueError: Input X must be non-negative.

I had a few negative numbers in my dataset. So how can I overcome this problem?

Note:- df has now transformations of y, its only having transformations of x.


Solution

  • You have a feature with all negative values:

    df['exp(x005)*log(x000)']
    

    returns

    0     -3630.638503
    1     -2212.931477
    2     -4751.790753
    3     -3754.508972
    4     -3395.387438
              ...
    501   -2022.382877
    502   -1407.856591
    503   -2998.638158
    504   -1973.273347
    505   -1267.482741
    Name: exp(x005)*log(x000), Length: 506, dtype: float64
    

    Quoting another answer (https://stackoverflow.com/a/46608239/5025009):

    The error message Input X must be non-negative says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative.

    In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1] interval as suggested by EdChum.

    If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:

    Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.