Search code examples
pythonsqliteword-embeddingfile-conversion

Incomplete word embedding model conversion with plasticityai/magnitude


I want to convert the word embedding model Numberbatch 19.08 to the .magnitude format used in plasticityai/magnitude. As I want to be able to use approximate nearest neighbor algorithms I run the command

python -m pymagnitude.converter -i numberbatch.txt -o numberbatch.magnitude -a

The size of the unpacked numberbatch.txt is about 20GB. I am using Windows10.

At first, the conversion seems to run fine (for some hours), showing progress like

Writing vectors... (this may take some time)

1% completed ... 99% completed

then

Committing written vectors... (this may take some time)

and finally

Creating search index... (this may take some time)

Creating spatial search index for dimension 2 (it has high entropy)... (this may take some time)

Creating approximate nearest neighbors index... (this may take some time)

However, I never get a final message that the conversion is complete. Rather, the program stops without any further messages.

And that stage I am left with the following three files in the target folder:

    15.891.668.992 numberbatch.magnitude.tmp
           557.056 numberbatch.magnitude.tmp-shm
       281.227.112 numberbatch.magnitude.tmp-wal

The intended end result, numberbatch.magnitude, is missing.

Any hint about what might have gone wrong would be much appreciated. Is there maybe any way to complete the conversion using the three tmp files?


Solution

  • I guess I found a partial answer to my own question in a closed issue of the plasticity/ai project:

    It seems that pymagnitude.converter cannot handle vector file sizes in the multi GB range when used together with the -a flag which produces the approximate nearest neighbors index. It was speculated in the issue that this is a problem of the underlying Annoy library, though the precise cause was never fully resolved.

    At this stage, the provisional remedy then is to abstain from using the -a flag.