Search code examples
machine-learningscikit-learnnaivebayesgridsearchcv

Why Naive Bayes gives results and on training and test but gives error of negative values when applied with GridSerchCV?


I have studied some related questions regarding Naive Bayes, Here are the links. link1, link2,link3 I am using TF-IDF for feature selection and Naive Bayes for classification. After fitting the model it gave the prediction successfully. and here is the output

accuracy = train_model(model, xtrain, train_y, xtest)
print("NB, CharLevel Vectors: ", accuracy)

NB, accuracy: 0.5152523571824736

I don't understand the reason why Naive Bayes did not give any error in the training and testing process

from sklearn.preprocessing import PowerTransformer
params_NB = {'alpha':[1.0], 'class_prior':[None], 'fit_prior':[True]}

gs_NB = GridSearchCV(estimator=model, 
                 param_grid=params_NB, 
                 cv=cv_method,
                 verbose=1, 
                 scoring='accuracy')

Data_transformed = PowerTransformer().fit_transform(xtest.toarray())
gs_NB.fit(Data_transformed, test_y);

It gave this error

Negative values in data passed to MultinomialNB (input X)

Solution

  • TL;DR: PowerTransformer, which you seem to apply only in the GridSearchCV case, produces negative data, which makes MultinomialNB to expectedly fail, es explained in detail below; if your initial xtrain and ytrain are indeed TF-IDF features, and you do not transform them similarly with PowerTransformer (you don't show something like that), the fact that they work OK is also unsurprising and expected.


    Although not terribly clear from the documentation:

    The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

    reading closely you realize that it implies that all the features should be positive.

    This has a statistical basis indeed; from the Cross Validated thread Naive Bayes questions: continus data, negative data, and MultinomialNB in scikit-learn:

    MultinomialNB assumes that features have multinomial distribution which is a generalization of the binomial distribution. Neither binomial nor multinomial distributions can contain negative values.

    See also the (open) Github issue MultinomialNB fails when features have negative values (it is for a different library, not scikit-learn, but the underlying mathematical rationale is the same).

    It is not actually difficult to demonstrate this; using the example available in the documentation:

    import numpy as np
    rng = np.random.RandomState(1)
    X = rng.randint(5, size=(6, 100))  # random integer data
    y = np.array([1, 2, 3, 4, 5, 6])
    from sklearn.naive_bayes import MultinomialNB
    clf = MultinomialNB()
    clf.fit(X, y) # works OK
    
    # inspect X
    X # only 0's and positive integers
    

    Now, changing a single element of X to a negative number and trying to fit again:

    X[1][0] = -1
    clf.fit(X, y)
    

    gives indeed:

    ValueError: Negative values in data passed to MultinomialNB (input X)
    

    What can you do? As the Github thread linked above suggests:

    • Either use MinMaxScaler(), which will bring all the features to [0, 1]
    • Or use GaussianNB instead, which does not suffer from this limitation