Search code examples
neural-networknlpsemanticstext-miningword2vec

How word2vec retrieves result from binary files?


from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('google_news.bin', binary=True)
print(model['the']) # this prints the 300D vector for the word 'the'

the code loads the google_news binary file to model. my question is, how the line 3 computes the output from a binary file ( Since Binary files contains 0's and 1's).


Solution

  • I'm not sure exactly what the question is here, but I assume you're asking how to load the binary into your Python app? You can use gensim for example which has built-in tools to decode the binary:

    from gensim.models.keyedvectors import KeyedVectors
    model = KeyedVectors.load_word2vec_format('google_news.bin', binary=True)
    print(model['the']) # this prints the 300D vector for the word 'the'
    

    EDIT

    I feel your question is more about binary files in general? This does not seem related to word2vec specifically. Anyways, in a word2vec binary file each line is a pair of word and weights in binary format. First the word is decoded into a string by looping the characters until it meets the binary character for "space". Then the rest is decoded from binary into floats. We know the number of floats since word2vec binary files have a header, such as "3000000 300", which tells us there are 3m words, each word is a 300D vector.

    A binary file is organized as a series of bytes, each 8 bits. Read more about binary on the wiki page.

    The number 0.0056 in decimal format, becomes in binary:

    00111011 10110111 10000000 00110100
    

    So here there are 4 bytes that make up a float. How do we know this? Because we assume the binary encodes 32 bit float.

    What if the binary file represents 64 bit precision floats? Then the decimal 0.0056 in binary becomes:

    00111111 01110110 11110000 00000110 10001101 10111000 10111010 11000111
    

    Yes, twice the length because twice the precision. So when we decode the word2vec file, if the weights are 300d, and 64 bit encoding, then there should be 8 bytes to represent each number. So a word embedding would have 300*64=19,200 binary digits in each line of the file. Get it?

    You can google "how binary digits" work, millions of examples.