Search code examples
pythonnltktokenizetext-mining

word_tokenize with same code and same dataset, but different result, why?


Last month, I tried to tokenize text and create a of words to see which word shows up frequently. Today, I want do it again in the same dataset with the same code. It still works but the result is different and obviously today's outcome is wrong because the frequency of appearing words decrease significantly.

Here is my code:

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import WordNetLemmatizer
import nltk
from collections import Counter

sent = nltk.word_tokenize(str(df.description))
lower_token = [t.lower() for t in sent]
alpha = [t for t in lower_token if t.isalpha()]
stop_word =  [t for t in alpha if t not in ENGLISH_STOP_WORDS]
k = WordNetLemmatizer()
lemma = [k.lemmatize(t) for t in stop_word]
bow = Counter(lemma)
print(bow.most_common(20))

Here is a sample of my dataset

This dataset is from Kaggle and the name of it is "Wine Reviews".


Solution

  • Welcome to StackOverflow.

    There could be two causes for your problem.

    1) It could be that you modified the dataset. For this, I would check the dataset and see if you made any changes to the data itself. Because your code works on other examples and will not change from day to day because there are no random elements to it.

    2) The second issue could be your use of df.description when you call a dataframe column in this line:

    sent = nltk.word_tokenize(str(df.description))
    

    you get a truncated output. Look at the type of df.description and it is a Series object.

    I created another example and it is as follows:

    from nltk.tokenize import word_tokenize
    import pandas as pd
    
    df = pd.DataFrame({'description' : ['The OP is asking a question and I referred him to the Minimum Verifible Example page which states: When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimal, reproducible example (reprex), a minimal, complete and verifiable example (mcve), or a minimal, workable example (mwe). Regardless of how it\'s communicated to you, it boils down to ensuring your code that reproduces the problem follows the following guidelines:']})
    
    
    print(df.description)
    
    0    The OP is asking a question and I referred him...
    Name: description, dtype: object
    

    As you see above, it is truncated and it is not the full text in the description column.

    My recommendation to your code is to look into this line of code and find a different way of doing it:

    sent = nltk.word_tokenize(str(df.description))
    

    Note that the method that you used in your code will include the index number (which I understand you filtered by isalpha) and also this Name: description, dtype: object in the data that you are processing.

    One way would be to use map to process your data. An example is:

    pd.set_option('display.max_colwidth', -1)
    df['tokenized'] = df['description'].map(str).map(nltk.word_tokenize)
    

    Proceed to do this for other operations as well. An easy way to do it would be to build a preprocessing function that applies all the pre-processing operations (that you want to use) on your dataframe.

    I hope this helps.