Search code examples
pythontextscikit-learnnltkdoc2vec

Drop most frequent words from dataset


I'm trying to work with text, in which there's a lot of repetition. I have used the tf-idf vectorizer before from SKLearn, and that has a parameter max_df=0.5. This means that if the word is present in more than 50% of the input, it doesn't use it. I'd like to know if there's a similar function in Python in general, or Doc2Vec or NLTK: I'd like to drop the words that are present in more than 50% of the dataset, without vectorizing them.

For example, I'd like to make from a dataframe like:

0 | This is new: A puppy ate cheese! See?
1 | This is new: A cat was found. See?
2 | This is new: Problems arise. See?

An output like this:

0 | puppy ate cheese
1 | cat was found
2 | problems arise

I've already done the de-capitalization and the stopword removal, now I'd just like to remove the most frequent words. I'd also like to store this information, as new input might come in, and I'd like to remove the same frequent words from the new input that I've found to be frequent in the original corpus.


Solution

  • you could do

    import nltk 
    allWords = nltk.tokenize.word_tokenize(text)
    allWordDist = nltk.FreqDist(w.lower() for w in allWords) 
    

    followed by

    mostCommon= allWordDist.most_common(10).keys()
    

    in preprocessing?

    if you look into the

    allWordDist .items()
    

    I think you will find everything you need.