Search code examples
textbinaryword2vec

faster way of reading word2vec txt in python


I have a standard word2vec output which is a .txt file formatted as follows:

[number of words] [dimension (300)]
word1 [300 float numbers separated by spaces]
word2 ...

Now I want to read at most M word representations out of this file. A simple way is to loop the first M+1 lines in the file, and store the M vectors into a numpy array. But this is super slow, is there a faster way?


Solution

  • What do you mean, "is super slow"? Compared to what?

    Because it's a given text format, there's no way around reading the file line-by-line, parsing the floats, and assigning them into a usable structure. But you might be doing things very inefficiently – without seeing your code, it's hard to tell.

    The gensim library in Python includes classes for working with word-vectors in this format. And, its routines include an optional limit argument for reading just a certain number of vectors from the front of a file. For example, this will read the 1st 1000 from a file named vectors.txt:

    word_vecs = KeyedVectors.load_word2vec_format('word-vectors.txt', 
                                                  binary=False,
                                                  limit=1000)
    

    I've never noticed it as being a particularly slow operation, even when loading something like the 3GB+ set of word-vectors Google released. (If it does seem super-slow, it could be you have insufficient RAM, and the attempted load is relying on virtual memory paging – which you never want to happe with a random-access data structure like this.)

    If you then save the vectors in gensim's native format, via .save(), and if the constituent numpy arrays are large enough to be saved as separate files, then you'd have the option of using gensim's native .load() with the optional mmap='r' argument. This would entirely skip any parsing of the raw on-disk numpy arrays, just memory-mapping them into addressable space – making .load() complete very quickly. Then, as ranges of the array are accessed, they'd be paged into RAM. You'd still be paying the cost of reading-from-disk all the data – but incrementally, as needed, rather than in a big batch up front.

    For example...

    word_vecs.save('word-vectors.gensim')
    

    ...then later...

    word_vecs2 = KeyedVectors.load('word_vectors.gensim', mmap='r')
    

    (There's no 'limit' option for the native .load().)