Search code examples
scikit-learnpipelinenmf

How can I correctly use Pipleline with MinMaxScaler + NMF to predict data?


This is a very small sklearn snipplet:

logistic = linear_model.LogisticRegression()

pipe = Pipeline(steps=[
    ('scaler_2', MinMaxScaler()),
    ('pca',  decomposition.NMF(6)),     
    ('logistic', logistic),
])

from sklearn.cross_validation import train_test_split   

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)

pipe.fit(Xtrain, ytrain)    
ypred = pipe.predict(Xtest)

I will get this error:

    raise ValueError("Negative values in data passed to %s" % whom)
ValueError: Negative values in data passed to NMF (input X)

According to this question: Scaling test data to 0 and 1 using MinMaxScaler

I know this is because

This is due to the fact that the lowest value in my test data was lower than the train data, of which the min max scaler was fit

But I am wondering, is this a bug? MinMaxScaler (all scalers) seems should be applied before I do the prediction, it should not depends on previous fitted training data, am I right?

Or how could I correctly use preprocessing scalers with Pipeline?

Thanks.


Solution

  • This is not a bug. The main reason that you add the scaler to the pipeline is to prevent leaking the information from your test set to your model. When you fit the pipeline to your training data, the MinMaxScaler keeps the min and max of your training data. It will use these values to scale any other data that it may see for prediction. As you also highlighted, this min and max are not necessarily the min and max of your test data set! Therefore you may end up having some negative values in your training set when the min of your test set is smaller than the min value in the training set. You need a scaler that does not give you negative values. For instance, you may usesklearn.preprocessing.StandardScaler. Make sure that you set the parameter with_mean = False. This way, it will not center the data before scaling but scales your data to unit variance.