I'm trying to get the tf-idf for a set of documents using the following code:
documents = ['iADV díltudNOUN iADV gaibidVERB gabálNOUN', 'iADV díthNOUN dérnumNOUN iADP foileNOUN', ...]
vocab = ['aADP', 'aDET', 'aPRON', 'achtSCONJ', 'amalSCONJ', 'arADP', 'arPRON', ...]
vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r"(?u)\b[\wáéíóúↄḟṁṅæǽ⁊ɫ֊̃]+\b", vocabulary=vocab)
vectors = vectorizer.fit_transform(documents)
print(vectors)
When I do this the matrix is empty. If I try to print([vectors])
instead, I can see the shape of the matrix, but there is no data in it.
[<42x79 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>]
Weirdly, when I remove the vocabulary=vocab
argument, I can get the tf-idf for all of the words in the documents, though, I really don't want it for all words:
vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r"(?u)\b[\wáéíóúↄḟṁṅæǽ⁊ɫ֊̃]+\b")
vectors = vectorizer.fit_transform(documents)
print(vectors)
(0, 564) 0.09058331497564333
(0, 313) 0.09058331497564333
(0, 93) 0.08155482537999634
(0, 165) 0.06268804803234075
(0, 169) 0.09058331497564333
...
What is causing my matrix to be empty when I pass the vocabulary
argument? Is there something wrong with my token_pattern
?
The problem comes from the default parameter lowercase
which is equal to True
. So, all your text is tranformed in lowercase. If you change your vocabulary to lowercase, it will work :
vocab=[v.lower() for v in vocab]
You can also change the paramater lowercase
to False