Search code examples

ValueError: np.nan is an invalid document, expected byte or unicode string

I am trying to perform sentiment analysis on Uber-Review. I have used Naive bays sklearn to perform sentiment analyis,I used trianing data from kaggle on reviwes, But The test data is in xlsx sheet, I used pandas to create data frame,

import pandas as pd

as it returned d:type object, I transformed it to list using this

test_text = []
for comments in comments_t:

My code for classifying text based on training data:

# Training Phase
from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB().fit(train_documents,labels)

def sentiment(word):
    return classifier.predict(count_vectorizer.transform([word]))

but while predicting it return this value error:

/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/ in transform(self, raw_documents)
   1085         # use the same matrix-building strategy as fit_transform
-> 1086         _, X = self._count_vocab(raw_documents, fixed_vocab=True)
   1087         if self.binary:

/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/ in _count_vocab(self, raw_documents, fixed_vocab)
    940         for doc in raw_documents:
    941             feature_counter = {}
--> 942             for feature in analyze(doc):
    943                 try:
    944                     feature_idx = vocabulary[feature]

/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/ in <lambda>(doc)
    326                                                tokenize)
    327             return lambda doc: self._word_ngrams(
--> 328                 tokenize(preprocess(self.decode(doc))), stop_words)
    330         else:

/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/ in decode(self, doc)
    142         if doc is np.nan:
--> 143             raise ValueError("np.nan is an invalid document, expected byte or "
    144                              "unicode string.")

ValueError: np.nan is an invalid document, expected byte or unicode string.

I tried to solve according to this:


  • the Data that i have found in Kaggle for Uber is

    now coming to your problem

    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import BernoulliNB
    df = pd.read_csv('Uber_Ride_Reviews.csv')
                                         ride_review    ...      sentiment
    0  I completed running New York Marathon requeste...    ...              0
    1  My appointment time auto repairs required earl...    ...              0
    2  Whether I using Uber ride service Uber Eats or...    ...              0
    3  Why hard understand I trying retrieve Uber cab...    ...              0
    4  I South Beach FL I staying major hotel ordered...    ...              0
    Out[8]: Index(['ride_review', 'ride_rating', 'sentiment'], dtype='object')
    vect  = CountVectorizer()
    vect1 = vect.fit_transform(df['ride_review'])
    classifier = BernoulliNB(),df['sentiment'])
    # predicting new comment it is giving O/p
    new_test_= vect.transform(['uber ride is very good']) 
    Out[5]: array([0], dtype=int64)
     # but when applying your function sentiment you are only passing word, you need to 
     #passclassifier as well as Countvectorizer instance 
    def sentiment(word, classifier, vect):
        return classifier.predict(vect.transform([word]))
    #calling above function for new sentiment
    sentiment('uber ride is very good', vect, classifier)
    O/p --> Out[10]: array([0], dtype=int64)