Search code examples
pythonpython-3.xerror-handlinggensimword2vec

gensim Getting Started Error: No such file or directory: 'text8'


I am learning about word2vec and GloVe model in python so I am going through this available here.

After I compiled these code step by step in Idle3:

>>>from gensim.models import word2vec
>>>import logging
>>>logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
>>>sentences = word2vec.Text8Corpus('text8')
>>>model = word2vec.Word2Vec(sentences, size=200)

I am getting this error :

2017-01-13 11:15:41,471 : INFO : collecting all words and their counts
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    model = word2vec.Word2Vec(sentences, size=200)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/word2vec.py", line 469, in __init__
    self.build_vocab(sentences, trim_rule=trim_rule)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/word2vec.py", line 533, in build_vocab
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/word2vec.py", line 545, in scan_vocab
    for sentence_no, sentence in enumerate(sentences):
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/word2vec.py", line 1536, in __iter__
    with utils.smart_open(self.fname) as fin:
  File "/usr/local/lib/python3.5/dist-packages/smart_open-1.3.5-py3.5.egg/smart_open/smart_open_lib.py", line 127, in smart_open
    return file_smart_open(parsed_uri.uri_path, mode)
  File "/usr/local/lib/python3.5/dist-packages/smart_open-1.3.5-py3.5.egg/smart_open/smart_open_lib.py", line 558, in file_smart_open
    return open(fname, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'text8'

How do I rectify this ? Thanks in advance for your help.


Solution

  • It seems you're missing the file used here. Specifically, it is trying to open text8 and can't find it (hence the FileNotFoundError).

    You could download the file itself from here as is stated in the documentation for Text8Corpus:

    Docstring:      
    Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip .
    

    and make it available. Extract it and then supply it as an argument to Text8Corpus:

    sentences = word2vec.Text8Corpus('/path/to/text8')