Search code examples
pythonscikit-learnpreprocessorknn

Accuracy of preprocessing single sample


I've been working to predict samples with the sklearn implementation of KNN.

So far i've been training my classifier with a sample of my dataset, and then testing it with another distinct sample of the dataset and appearing to see an accuracy of around 98%.

However, when attempting to predict a single sample the predictions are all over the place even when using samples the model has been trained on. The only guess i have is that there is a problem when preprocessing the entire dataset with preprocessing.scale versus preprocessing a single sample with the same technique.

I've read Preprocessing in scikit learn - single sample - Depreciation warning and am wondering if there is a correct way to preprocess a single sample.

EDIT: Code for preprocessing shown below For the whole dataset:

self.trainData = preprocessing.scale(self.trainData)

For a single sample, where log is of the same form as samples in traindata.

log = preprocessing.scale(log)


Solution

  • You should use StandardScaler which is a wrapper over the scale function as described here. This wrapper stores the mean and standard deviation learned from the training data and then uses this information to scale the other data.

    Example usage:

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    
    trainData = scaler.fit_transform(trainData)
    # I have used reshape because of single sample. In other cases, its not needed
    log = scaler.transform(np.reshape(log, (1,-1)))
    

    fit_transform() is just a shortcut for first calling fit() and then transform().

    fit() method does not return anything. It just analyses the data to learn the mean and standard_deviation. transform() will use the learnt mean and std to scale the data and returns the new data.

    You should only call fit() or fit_transform() on the training data,never on anything else. For transforming the test or new data, always use transform().