Search code examples
pythonscikit-learntfidfvectorizer

Why is sklearn's TfidfVectorizer returning an empty matrix when I pass an argument for vocabulary, but not when I don't?


I'm trying to get the tf-idf for a set of documents using the following code:

documents = ['iADV díltudNOUN iADV gaibidVERB gabálNOUN', 'iADV díthNOUN dérnumNOUN iADP foileNOUN', ...]
vocab = ['aADP', 'aDET', 'aPRON', 'achtSCONJ', 'amalSCONJ', 'arADP', 'arPRON', ...]

vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r"(?u)\b[\wáéíóúↄḟṁṅæǽ⁊ɫ֊̃]+\b", vocabulary=vocab)
vectors = vectorizer.fit_transform(documents)
print(vectors)

When I do this the matrix is empty. If I try to print([vectors]) instead, I can see the shape of the matrix, but there is no data in it.

[<42x79 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>]

Weirdly, when I remove the vocabulary=vocab argument, I can get the tf-idf for all of the words in the documents, though, I really don't want it for all words:

vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r"(?u)\b[\wáéíóúↄḟṁṅæǽ⁊ɫ֊̃]+\b")
vectors = vectorizer.fit_transform(documents)
print(vectors)

  (0, 564)  0.09058331497564333
  (0, 313)  0.09058331497564333
  (0, 93)   0.08155482537999634
  (0, 165)  0.06268804803234075
  (0, 169)  0.09058331497564333
  ...

What is causing my matrix to be empty when I pass the vocabulary argument? Is there something wrong with my token_pattern?


Solution

  • The problem comes from the default parameter lowercase which is equal to True. So, all your text is tranformed in lowercase. If you change your vocabulary to lowercase, it will work :

    vocab=[v.lower() for v in vocab]
    

    You can also change the paramater lowercase to False