Search code examples
scikit-learncountvectorizer

How do I ignore short documents with Sklearn?


I am using Sklearn's CountVectorizer() to transform my text document into an article-word co-occurrence matrix. It has worked great, however I want it to exclude rows corresponding to documents that contain less than k words.

I have attempted to do this via simple for loop however as I'm working with spare arrays it doesn't work. It isn't the most elegant code either - there must be a better way!

The code below finds the co-occurrence matrix X, the loop cycles through each row and checks if there are more than k words.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(doc)

for i in range(len(data)):
if sum(X[i,:])<k:
    count += 1

Solution

  • You could make use of getnnz as shown below:

    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(doc)
    k = 100
    X_reduced = X[X.getnnz(axis=1)>=k]