Search code examples
python-3.xscikit-learncountvectorizer

How to use the Scikit learn CountVectorizer?


I have a set of words for which I have to check whether they are present in the documents.

WordList = [w1, w2, ..., wn]

Another set have list of documents on which I have to check whether these words are present or not.

How to use scikit-learn CountVectorizer so that the features of term-document matrix are only words from WordList and each row represents each particular document with no of times the word from the given list appears in their respective column?


Solution

  • Ok. I get it. The code is given below:

    from sklearn.feature_extraction.text import CountVectorizer
    # Counting the no of times each word(Unigram) appear in document. 
    vectorizer = CountVectorizer(input='content',binary=False,ngram_range=(1,1))
    # First set the vocab
    vectorizer = vectorizer.fit(WordList)
    # Now transform the text contained in each document i.e list of text 
    Document:list
    tfMatrix = vectorizer.transform(Document_List).toarray()
    

    This will output only the term-document matrix with features from wordList only.