Search code examples
scikit-learntext-miningtf-idftfidfvectorizercountvectorizer

Searching for a word group with TFidfvectorizer


I'm using sklearn to receive the TF-IDF for a given keyword list. It works fine but the only thing not working is that it doesn't count word groups such as "car manufacturers". How could I fix this? Should I use a different module ?

Pfa, the first lines of code so you see which modules I used. Thanks in advance !

import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from pathlib import Path


# root dir
root = '/Users/Tom/PycharmProjects/TextMining/'
#
words_to_find = ['vehicle', 'automotive', 'car manufacturers']
# tf_idf file writing
wrote_tf_idf_header = False
tf_idf_file_idx = 0
#
vectorizer_tf_idf = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None, vocabulary=words_to_find)
vectorizer_cnt = CountVectorizer(stop_words=None, vocabulary=words_to_find)

Solution

  • You need to pass the ngram_range parameter in the CountVectorizer to get the result you are expecting. You can read the documentation with an example here.

    https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

    You can fix this like this.

    import numpy as np
    import os
    from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
    from pathlib import Path
    
    
    # root dir
    # root = '/Users/Tom/PycharmProjects/TextMining/'
    root = ['car manufacturers vehicle vehicales vehicle automotive car house manufacturers']
    #
    words_to_find = ['vehicle', 'automotive', 'car manufacturers']
    # tf_idf file writing
    wrote_tf_idf_header = False
    tf_idf_file_idx = 0
    #
    vectorizer_tf_idf = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None, vocabulary=words_to_find)
    vectorizer_cnt = CountVectorizer(stop_words=None, vocabulary=words_to_find, ngram_range=(1,2))
    x = vectorizer_cnt.fit_transform(root)
    print(vectorizer_cnt.get_feature_names())
    print(x.toarray())
    

    Output:

    ['vehicle', 'automotive', 'car manufacturers']
    [[2 1 1]]