I have a standard word2vec output which is a .txt file formatted as follows:
[number of words] [dimension (300)]
word1 [300 float numbers separated by spaces]
word2 ...
Now I want to read at most M
word representations out of this file. A simple way is to loop the first M+1
lines in the file, and store the M
vectors into a numpy array. But this is super slow, is there a faster way?
What do you mean, "is super slow"? Compared to what?
Because it's a given text format, there's no way around reading the file line-by-line, parsing the floats, and assigning them into a usable structure. But you might be doing things very inefficiently – without seeing your code, it's hard to tell.
The gensim
library in Python includes classes for working with word-vectors in this format. And, its routines include an optional limit
argument for reading just a certain number of vectors from the front of a file. For example, this will read the 1st 1000 from a file named vectors.txt
:
word_vecs = KeyedVectors.load_word2vec_format('word-vectors.txt',
binary=False,
limit=1000)
I've never noticed it as being a particularly slow operation, even when loading something like the 3GB+ set of word-vectors Google released. (If it does seem super-slow, it could be you have insufficient RAM, and the attempted load is relying on virtual memory paging – which you never want to happe with a random-access data structure like this.)
If you then save the vectors in gensim
's native format, via .save()
, and if the constituent numpy arrays are large enough to be saved as separate files, then you'd have the option of using gensim
's native .load()
with the optional mmap='r'
argument. This would entirely skip any parsing of the raw on-disk numpy arrays, just memory-mapping them into addressable space – making .load()
complete very quickly. Then, as ranges of the array are accessed, they'd be paged into RAM. You'd still be paying the cost of reading-from-disk all the data – but incrementally, as needed, rather than in a big batch up front.
For example...
word_vecs.save('word-vectors.gensim')
...then later...
word_vecs2 = KeyedVectors.load('word_vectors.gensim', mmap='r')
(There's no 'limit' option for the native .load()
.)