I'm using sklearn to receive the TF-IDF for a given keyword list. It works fine but the only thing not working is that it doesn't count word groups such as "car manufacturers". How could I fix this? Should I use a different module ?
Pfa, the first lines of code so you see which modules I used. Thanks in advance !
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from pathlib import Path
# root dir
root = '/Users/Tom/PycharmProjects/TextMining/'
#
words_to_find = ['vehicle', 'automotive', 'car manufacturers']
# tf_idf file writing
wrote_tf_idf_header = False
tf_idf_file_idx = 0
#
vectorizer_tf_idf = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None, vocabulary=words_to_find)
vectorizer_cnt = CountVectorizer(stop_words=None, vocabulary=words_to_find)
You need to pass the ngram_range
parameter in the CountVectorizer to get the result you are expecting. You can read the documentation with an example here.
You can fix this like this.
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from pathlib import Path
# root dir
# root = '/Users/Tom/PycharmProjects/TextMining/'
root = ['car manufacturers vehicle vehicales vehicle automotive car house manufacturers']
#
words_to_find = ['vehicle', 'automotive', 'car manufacturers']
# tf_idf file writing
wrote_tf_idf_header = False
tf_idf_file_idx = 0
#
vectorizer_tf_idf = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None, vocabulary=words_to_find)
vectorizer_cnt = CountVectorizer(stop_words=None, vocabulary=words_to_find, ngram_range=(1,2))
x = vectorizer_cnt.fit_transform(root)
print(vectorizer_cnt.get_feature_names())
print(x.toarray())
Output:
['vehicle', 'automotive', 'car manufacturers']
[[2 1 1]]