scikit-learn text-mining tf-idf tfidfvectorizer countvectorizer

Searching for a word group with TFidfvectorizer

I'm using sklearn to receive the TF-IDF for a given keyword list. It works fine but the only thing not working is that it doesn't count word groups such as "car manufacturers". How could I fix this? Should I use a different module ?

Pfa, the first lines of code so you see which modules I used. Thanks in advance !

import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from pathlib import Path


# root dir
root = '/Users/Tom/PycharmProjects/TextMining/'
#
words_to_find = ['vehicle', 'automotive', 'car manufacturers']
# tf_idf file writing
wrote_tf_idf_header = False
tf_idf_file_idx = 0
#
vectorizer_tf_idf = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None, vocabulary=words_to_find)
vectorizer_cnt = CountVectorizer(stop_words=None, vocabulary=words_to_find)

Solution

You need to pass the ngram_range parameter in the CountVectorizer to get the result you are expecting. You can read the documentation with an example here.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

You can fix this like this.

import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from pathlib import Path


# root dir
# root = '/Users/Tom/PycharmProjects/TextMining/'
root = ['car manufacturers vehicle vehicales vehicle automotive car house manufacturers']
#
words_to_find = ['vehicle', 'automotive', 'car manufacturers']
# tf_idf file writing
wrote_tf_idf_header = False
tf_idf_file_idx = 0
#
vectorizer_tf_idf = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None, vocabulary=words_to_find)
vectorizer_cnt = CountVectorizer(stop_words=None, vocabulary=words_to_find, ngram_range=(1,2))
x = vectorizer_cnt.fit_transform(root)
print(vectorizer_cnt.get_feature_names())
print(x.toarray())

Output:

['vehicle', 'automotive', 'car manufacturers']
[[2 1 1]]