Search code examples
python-3.xpandasmachine-learningnltkdata-cleaning

Preprocessing string data in pandas dataframe


I have a user review dataset. I have loaded this dataset and now i want to preprocess the user reviews(i.e. removing stopwords, punctuations, convert to lower case, removing salutations etc.) before fitting it to a classifier but i am getting error. Here is my code:

    import pandas as pd
    import numpy as np
    df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
    dataset=df.filter(['overall','reviewText'],axis=1)
    def cleanText(text):
        """
        removes punctuation, stopwords and returns lowercase text in a list 
        of single words
        """
        text = (text.lower() for text in text)   

        from bs4 import BeautifulSoup
        text = BeautifulSoup(text).get_text()

        from nltk.tokenize import RegexpTokenizer
        tokenizer = RegexpTokenizer(r'\w+')
        text = tokenizer.tokenize(text)

        from nltk.corpus import stopwords
        clean = [word for word in text if word not in 
        stopwords.words('english')]

        return clean

    dataset['reviewText']=dataset['reviewText'].apply(cleanText)
    dataset['reviewText']

I am getting these errors:

TypeError                                 Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
      2 dataset['reviewText']

~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356 
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-64-5c6792de405c> in cleanText(text)
     10     from nltk.tokenize import RegexpTokenizer
     11     tokenizer = RegexpTokenizer(r'\w+')
---> 12     text = tokenizer.tokenize(text)
     13 
     14     from nltk.corpus import stopwords

~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
    127         # If our regexp matches tokens, use re.findall:
    128         else:
--> 129             return self._regexp.findall(text)
    130 
    131     def span_tokenize(self, text):

TypeError: expected string or bytes-like object

and

TypeError                                 Traceback (most recent call last)
<ipython-input-70-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
      2 dataset['reviewText']

~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356 
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-69-5c6792de405c> in cleanText(text)
     10     from nltk.tokenize import RegexpTokenizer
     11     tokenizer = RegexpTokenizer(r'\w+')
---> 12     text = tokenizer.tokenize(text)
     13 
     14     from nltk.corpus import stopwords

~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
    127         # If our regexp matches tokens, use re.findall:
    128         else:
--> 129             return self._regexp.findall(text)
    130 
    131     def span_tokenize(self, text):

TypeError: expected string or bytes-like object

Please suggest corrections in this function for my data or suggest a new function for data cleaning.

Here is my data:

    overall reviewText
0   5   Not much to write about here, but it does exac...
1   5   The product does exactly as it should and is q...
2   5   The primary job of this device is to block the...
3   5   Nice windscreen protects my MXL mic and preven...
4   5   This pop filter is great. It looks and perform...
5   5   So good that I bought another one. Love the h...
6   5   I have used monster cables for years, and with...
7   3   I now use this cable to run from the output of...
8   5   Perfect for my Epiphone Sheraton II. Monster ...
9   5   Monster makes the best cables and a lifetime w...
10  5   Monster makes a wide array of cables, includin...
11  4   I got it to have it if I needed it. I have fou...
12  3   If you are not use to using a large sustaining...
13  5   I love it, I used this for my Yamaha ypt-230 a...
14  5   I bought this to use in my home studio to cont...
15  2   I bought this to use with my keyboard. I wasn'...

Solution

  • print(df)

        overall reviewText
    0   5   Not much to write about here, but it does exac...
    1   5   The product does exactly as it should and is q...
    2   5   The primary job of this device is to block the...
    3   5   Nice windscreen protects my MXL mic and preven...
    4   5   This pop filter is great. It looks and perform...
    5   5   So good that I bought another one. Love the h...
    6   5   I have used monster cables for years, and with...
    7   3   I now use this cable to run from the output of...
    8   5   Perfect for my Epiphone Sheraton II. Monster ...
    9   5   Monster makes the best cables and a lifetime w...
    10  5   Monster makes a wide array of cables, includin...
    11  4   I got it to have it if I needed it. I have fou...
    12  3   If you are not use to using a large sustaining...
    13  5   I love it, I used this for my Yamaha ypt-230 a...
    14  5   I bought this to use in my home studio to cont...
    15  2   I bought this to use with my keyboard. I wasn'...
    

    To convert into lowercase

    df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : str.lower(x))
    

    To remove punctuation and numbers

    import re
    df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : " ".join(re.findall('[\w]+',x)))
    

    To remove stopwords, you can either install stopwords or create your own stopword list and use it with a function

    from stop_words import get_stop_words
    stop_words = get_stop_words('en')
    
    def remove_stopWords(s):
        '''For removing stop words
        '''
        s = ' '.join(word for word in s.split() if word not in stop_words)
        return s
    
    df.loc[:,"reviewText"] = df.reviewText.apply(lambda x: remove_stopWords(x))