Text classification + Naive Bayes + Python : Input contains NaN, infinity or a value too large for dtype('float64')

I am trying to do text classification with Naive Bayes. This is my code:

#splitting Pandas dataframe into train set and test set

x_train, x_test, y_train, y_test = cross_validation.train_test_split(data['description'], data['category_id'], test_size=0.2, random_state=42)

#production of bag of words from x_train

count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(x_train)
train_vocab = count_vect.get_feature_names()

#training the Naive Bayes classifier

clf = MultinomialNB().fit(x_train_counts, y_train)


ValueError                                Traceback (most recent call last)
<ipython-input-46-0cb3dc7193bf> in <module>()
      1 #training the Naive Bayes classifier
----> 3 clf = MultinomialNB().fit(x_train_counts, y_train)

~/anaconda3/envs/tensorflow/lib/python3.5/site-packages/sklearn/ in fit(self, X, y, sample_weight)
    577             Returns self.
    578         """
--> 579         X, y = check_X_y(X, y, 'csr')
    580         _, n_features = X.shape

~/anaconda3/envs/tensorflow/lib/python3.5/site-packages/sklearn/utils/ in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    577     else:
    578         y = column_or_1d(y, warn=True)
--> 579         _assert_all_finite(y)
    580     if y_numeric and y.dtype.kind == 'O':
    581         y = y.astype(np.float64)

~/anaconda3/envs/tensorflow/lib/python3.5/site-packages/sklearn/utils/ in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The type of x_train_counts is scipy.sparse.csr.csr_matrix.

<class 'scipy.sparse.csr.csr_matrix'>

The type of y_train is pandas.core.series.Series.

<class 'pandas.core.series.Series'>


  • I suspect the issue is related to your data['description'] and data['category_id']. Is the first one something like an array with n elements comprising of texts and the second another array like object also with n elements consisting of labels for for the first, e.g, ['0', '1', '3', ...]?

    As a test, only by replacing your data with some sklearn dataset would produce a correct run:

    from sklearn.datasets import fetch_20newsgroups
    categories = ['alt.atheism', 'soc.religion.christian',
                   '', '']
    dataset = fetch_20newsgroups(subset='train',
         categories=categories, shuffle=True, random_state=42)
    x_train, x_test, y_train, y_test = cross_validation.train_test_split(,, test_size=0.2, random_state=42)
    #production of bag of words from x_train
    count_vect = CountVectorizer()
    x_train_counts = count_vect.fit_transform(x_train)
    train_vocab = count_vect.get_feature_names()
    #training the Naive Bayes classifier
    clf = MultinomialNB().fit(x_train_counts, y_train)

    Try to test that out and let me know if it helps.