python machine-learning scikit-learn artificial-intelligence logistic-regression

Pipeline giving different answer in sklearn python

I have written two programs which are supposed to follow the same logic. But both of them are giving different answers.

First-

train_data = train_features[:1710][:]
train_label = label_features[:1710][:].ravel()
test_data = train_features[1710:][:]
test_label = label_features[1710:][:].ravel()

def getAccuracy(ans):
    d = 0
    for i in range(np.size(ans,0)):
        if(ans[i] == test_label[i]):
            d+=1
    return (d*100)/float(np.size(ans,0))

estimators = [('pps', pps.RobustScaler()), ('clf', LogisticRegression())]
pipe = Pipeline(estimators)
pipe = pipe.fit(train_data,train_label)

ans = pipe.predict(test_data)
getAccuracy(ans)

Second-

train_data = train_features[:1710][:]
train_label = label_features[:1710][:].ravel()
test_data = train_features[1710:][:]
test_label = label_features[1710:][:].ravel()

def getAccuracy(ans):
    d = 0
    for i in range(np.size(ans,0)):
        if(ans[i] == test_label[i]):
            d+=1
    return (d*100)/float(np.size(ans,0))

def preprocess(features):
    return pps.RobustScaler().fit_transform(features)

train_data = preprocess(train_data)
clf = LogisticRegression().fit(train_data,train_label)

test_data = preprocess(test_data)
ans = clf.predict(test_data)
getAccuracy(ans)

First one gives 80.81 and second one gives 84.92. Why are both of them different?

Solution

Your second code is invalid, since your "preprocess" fits the scaler to test set, which should not happen. Pipeline, on the other hand only fits RobustScaler to your train data and then calls "transform" on the test one.