I am trying to follow the tutorial on topic modelling / Latent Dirichlet Allocation (LDA) in the book Building Machine Learning Systems" with Python.
I have not gone too far in this book, and the the first part of topic modelling returns errors for me:
from gensim import corpora, models, similarities
corpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt')
Error:
63
64 self.fname = fname
---> 65 with utils.smart_open(fname_vocab) as fin:
66 words = [utils.to_unicode(word).rstrip() for word in fin]
67 self.id2word = dict(enumerate(words))
/Users/user/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/gensim/utils.pyc in smart_open(fname, mode)
659 from gzip import GzipFile
660 return make_closing(GzipFile)(fname, mode)
--> 661 return open(fname, mode)
662
663
IOError: [Errno 2] No such file or directory: './data/ap/vocab.txt'
The vocab.txt file does not exists, but switching to the directory where it is supposed to be, I find the following:
$ ls download_ap.sh download_wp.sh preprocess-wikidata.sh
It looks like the ap data needs to be downloaded separately (not mentioned in the book), so by doing this:
sh download_ap.sh
I get this:
download_ap.sh: line 2: wget: command not found
tar: Error opening archive: Failed to open 'ap.tgz'
Does anybody knows how to solve this issue?
Thank you
There is nothing wrong with the code or your dev environment. The most likely problem is that you don't have wget. The same functionality can be achieved with CURL, in case you want to try it. You can also download the Associated Press corpus directly from some other source (do a Google search) and place it in the directory that Gensim is using for the tutorial.
If you want to follow the tutorials exactly as in the book, you probably need to install wget, which for OS X (I assume that's your system), requires a little bit of configuration. To add and install wget to OS X you need to download the source files, compile the code and make an install. To actually compile the code you need a compiler, unfortunately it doesn’t come with OS X by default. First you need to install xcode suite from Apple which includes the GCC compiler.
This post explains how to do it step by step.
Hope this works.