Search code examples
pythonscikit-learnn-gramcountvectorizer

Obtaining feature vector from existing matrix


If I use Scikit-learn to configure a CountVectorizer object and pass a matrix M of n sentencens (of varying length) to the fit_transform function, I can for example obtain an n-gram representation F. Like this:

vectorizer = CountVectorizer(min_df = 1,
                             max_features = 2000,
                             ngram_range = (2, 2),
                             analyzer="word)

F = vectorizer.fit_transform(A)

This works well. F will now have the shape (2000, n) because I've set max_features to 2000.

But let's say that I obtain one more sentence, and would like to generate a vector that aligns with the features of F and has the same length (2000).. is this even possible, or do I need to keep the original matrix M, add the new sentence to it, and then re-generate all the features?


Solution

  • If I understand what you are asking, you can transform additional sentences using vectorizer.transform(['New sentence here']).