Search code examples
pythontagsgensimdoc2vec

I get more vectors than my documents size - gensim doc2vec


I have protein sequences and want to do doc2vec. My goal is to have one vector for each sentence/sequence.

I have 1612 sentences/sequences and 30 classes so the label is not unique and many documents share the same labels.

So when I first tried doc2vec, it gave my just 30 vectors which is the number of unique labels. Then I decided to have multiple tags to get a vector for each sentence.

When I did this I ended up having more vectors than my sentences. Any explanations what might have gone wrong?

Screenshot of my data

Screenshot of corpus

tagged = data.apply(lambda r: TaggedDocument(words=(r["A"]), tags=[r.label,r.id]), axis=1)

print(len(tagged))

1612

sents = tagged.values

model = Doc2Vec(sents, size=5, window=5, iter=20, min_count = 0)

sents.shape

(1612,)

model.docvecs.vectors_docs.shape

(1643,5)

Screenshot of my data


Solution

  • The number of tags a Doc2Vec model will learn is equal to the number of unique tags you've provided. You provided 1612 different r.id values, and 30 different r.label values, hence a total number of tags larger than just your document count.

    (I suspect your r.id values are plain integers, but start at 1. If you use plain integers, rather than strings, as tags, then Doc2Vec will use those ints as the indexes into its internal vector-array directly. And thus int indexes that are less than the numbers you use, like 0, will also be allocated. Hence, your count of 1612 + 30 + 1 total known tags, because it also allocated space for tag 0.)

    So, that explains your tag count, and there's nothing necessarily wrong. Beware however:

    • Your dataset is very small: most published work uses 10s-of-thousands to millions of documents. You can sometimes still eke out useful vectors by using smaller vectors or more training epochs, but mainly Doc2Vec and similar algorithms need more data to work best. (Still: a vector size=5 is quite tiny!)

    • With small data, especially, the simple PV-DBOW mode (dm=0) is often a fast-training top-performer. (But note: it doesn't train word-vectors using context windows unless you add dbow_words=1 option, which then again slows it down with that extra word-vector training.)

    • Whether you should be using the labels as document-tags at all is not certain - the classic use of Doc2Vec just gives each doc a unique ID – then lets downstream steps learn the relations to other things. Mixing in known other document-level labels can sometimes help, or hurt, depending on your data and ultimate goals. (More tags can, to an extent, "dilute" whatever is learned over a larger model.)

    • At least in natural language, retaining words that only appear once or a few times can often be harmful to overall vector-quality. There's too few occurrences to model them well, and since there will, by Zipf's law, be many such words, they can wind up interfering a lot wit the training of other entitites. So a default min_count=5 (or even higher with larger datasets) often helps overall quality, and you shouldn't assume that simply retaining more data, with min_count=0, necessarily helps.