Search code examples
pythontextscikit-learntokenize

CountVectorizer in sklearn with only words above some minimum number of occurrences


I am using sklearn to train a logistic regression on some text data, by using CountVectorizer to tokenize the data into bigrams. I use a line of code like the one below:

vect= CountVectorizer(ngram_range=(1,2), binary =True)

However, I'd like to limit myself to only including bigrams in my resultant sparse matrix that occur more than some threshold number of times (e.g., 50) across all of my data. Is there some way to specify this or make it happen?


Solution

  • It looks like this can be solved by using CountVectorizer's min_df argument:

    vect= CountVectorizer(ngram_range=(1,2), binary =True, min_df = 500)