Search code examples
pythongensimword2vec

invalid literal for int() with base 10: '<!DOCTYPE


I'm trying to use pre-trained word2vec in Google Colab. Previously I downloaded the model onto my C:/, and then uploaded it to my Google Drive. However, I get this error I can't seem to find anywhere.

My code is:

from gensim.models import word2vec
import urllib.request

urllib.request.urlretrieve("https://drive.google.com/file/d/1lgCddPxJC__QA-qGtYTdNNoHRiYWyOpQ/view?usp=sharing/GoogleNews-vectors-negative300.bin", "GoogleNews-vectors-negative300.bin")

word2vec_path = 'GoogleNews-vectors-negative300.bin'
word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

Error Message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-354-492ef9dcbbcc> in <module>()
      1 word2vec_path = 'GoogleNews-vectors-negative300.bin'
----> 2 word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

2 frames
/usr/local/lib/python3.7/dist-packages/gensim/models/utils_any2vec.py in <genexpr>(.0)
    171     with utils.smart_open(fname) as fin:
    172         header = utils.to_unicode(fin.readline(), encoding=encoding)
--> 173         vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
    174         if limit:
    175             vocab_size = min(vocab_size, limit)

ValueError: invalid literal for int() with base 10: '<!DOCTYPE'

Solution

  • As use ~deceze notes, that error hints that the file has some typical HTML boilerplate (<~DOCTYPE) where the code is expecting 2 ints declaring the forthcoming count-of-vectors (vocab_size) & their dimensionality (vector_size).

    It's likely your urlrequest() action didn't receive the file you expected, and perhaps got a 'file not found' or other error instead. So:

    • Check its size & contents to see if it's what you expect.
    • Check your request code, to ensure it can even get what you need from a random cloud notebook. (Maybe the Google Drive URL requires a logged-in user, and your Colab notebook isn't able to make web requests as a logged-in version of you?)
    • If you have the valid file elsewhere, see if you can send that valid copy directly to the scratch storage space of the notebook.