Search code examples
pythonunicodepandasnltkstemming

UnicodeDecodeError unexpected end of data while stemming over dataset


I am new to python and I am trying to work on a small chunk of Yelp! dataset which was in JSON but I converted to CSV, using pandas libraries and NLTK.

While doing preprocessing of data, I first try to remove all the punctuations and also the most common stop words. After doing that, I want to apply the Porter Stemming algorithm which is readily available in nltk.stem.

Here is my code:

"""A method for removing the noise in the data and the most common stop.words (NLTK)."""
def stopWords(review):

    stopset = set(stopwords.words("english"))
    review = review.lower()
    review = review.replace(".","")
    review = review.replace("-"," ")
    review = review.replace(")","")
    review = review.replace("(","")
    review = review.replace("i'm"," ")
    review = review.replace("!","")
    review = re.sub("[$!@#*;:<+>~-]", '', review)
    row = review.split()

    tokens = ' '.join([word for word in row if word not in stopset])
    return tokens

and i use the tokens here to input in an stemming method i wrote:

"""A method for stemming the words to their roots using Porter Algorithm (NLTK)"""
def stemWords(impWords):
    stemmer = stem.PorterStemmer()
    tok = stopWords(impWords)
    ========================================================================
    stemmed = " ".join([stemmer.stem(str(word)) for word in tok.split(" ")])
    ========================================================================
    return stemmed

But i am getting an error UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: unexpected end of data. The line that is inside the '==' is giving me the error.

I have tried cleaning the data and removing all special characters !@#$^&* and others to make this work. But the stop words are working fine. The stemming does not work. Can somebody tell me where i am doing it wrong?

If my data is not clean, or the unicode string is breaking somewhere, any way i can clean it or fix itso that it won't give me this error? I want to do stemming, any suggestions would be helpful.


Solution

  • Read up on unicode string processing in python. There is the type str but there is also a type unicode.

    I suggest to:

    1. decode each line immediately after reading, to narrow down incorrect characters in your input data (real data contains errors)

    2. work with unicode and u" " strings everywhere.