Processing a Corpus For a word2vec Implementation

As part of a class project, I'm trying to write a word2vec implementation in Python and train it on a corpus of ~6GB. I'm trying to code a reasonably optimized solution so I don't have to let my PC sit for days.

Going through the C word2vec source code, I notice that there, each thread reads words from a file, and takes the time to look up the index of every word. At the end, it stores a "sentence" of word indexes.

Wouldn't it be logical to translate the whole corpus into one containing integer indexes of the appropriate words? That way, time isn't lost during training on hash-table lookups, while the translation process is a one-time expense.

I understand that for extremely large corpuses, you are effectively doubling the amount it takes on disk, which you might want to avoid.

However, if you do have the memory, wouldn't this offer a noticeable increase in efficiency? Or am I just overestimating the impact of a table lookup?

Solution

Hashtable lookups can be very fast, and repeated lookups may not contribute much to the overall runtime.

But the only way to really know the potential speedup of your proposed optimization is to implement it, and profile it in comparison to the prior behavior.

Also, as you note, to be able to re-use a single-pass token-lookup, you'd need to store those results somewhere. Google's word2vec.c code, like many other implementations, seeks to work well with input corpuses that are far larger than addressable memory. Writing the interim tokenization to disk would require extra code complication, and extra working space on disk, compared to the baseline of repeated lookups. So: even if it did speed things a little, implementors might consider the extra complexity undesirable.