Search code examples
pythongensimtopic-modelingdoc2vec

Gensim tagging documents with big numbers


I want to label my documents with tags mapped to id attribute in database. The ids can be for example also like this:

documents[0] is for example

TaggedDocument(words=['blabla', 'request'], tags=[225616076])

For some reason, it is not able to build_vocabulary. Although I have only 33382 unique ids/tags with higher values, it does not matter, gensim writes that I have '225616077 tags' (in the log).

2018-07-30 12:07:59,271 : INFO : collecting all words and their counts
2018-07-30 12:07:59,273 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-07-30 12:07:59,330 : INFO : PROGRESS: at example #1000, processed 7974 words (314086/s), 1975 word types, 225616077 tags
2018-07-30 12:07:59,343 : INFO : PROGRESS: at example #2000, processed 15882 words (701054/s), 2794 word types, 225616077 tags
...

...  
2018-07-30 12:14:56,454 : INFO : estimated required memory for 6765 words and 20 dimensions: 19793760900 bytes
2018-07-30 12:14:56,457 : INFO : resetting layer weights

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
in <module>()
----> 1 model.build_vocab(documents)

How can I solve this problem? I do not want to start from 0 and then map it to the higher numbers (uselessly used compute time). I also tried it to tag it as strings (so the documents[0] is TaggedDocument(words=['blabla', 'request'], tags=['225616076'])) but it does not work either.

I am inspecting gensim's code but can not get to solution on my own.


Solution

  • If you are using plain python int values as doc-tags, then the code assumes you want these to also be the raw int indexes into the underlying vector-array – and a vector-array large enough to hold your largest index will be allocated – even if many lower numbers go unused.

    This is an optimization to allow the code to avoid building the usual tag-to-index mapping, for those people who have neatly identified texts, numbered from 0 up.

    If your IDs aren't contiguous starting from 0, and can't easily be made to work that way, you can use string tags, which the code will recognize need to be mapped to unique index positions - and only a vector-array exactly the right size will be allocated.

    For example, your documents[0] would then be:

    TaggedDocument(words=['blabla', 'request'], tags=[str(225616076)])