I just started my first NLTK project and am confused about the proper setup. I need several resources like the Punkt Tokenizer and the maxent pos tagger. I myself downloaded them using the GUI nltk.download()
. For my collaborators I of course want that this things get downloaded automatically. I haven't found any idiomatic code for that in the docu.
Am I supposed to just put nltk.data.load('tokenizers/punkt/english.pickle')
and their like into the code? Is this going to download the resources every time the script is run? Am I to provide feedback to the user (i.e. my co-developers) of what is being downloaded and why this is taking so long? There MUST be gear out there that does the job, right? :)
//Edit To explify my question:
How do I test whether an nltk resource (like the Punkt Tokenizer) is already installed on the machine running my code, and install it if it is not?
You can use the nltk.data.find()
function, see https://github.com/nltk/nltk/blob/develop/nltk/data.py:
>>> import nltk
>>> nltk.data.find('tokenizers/punkt.zip')
ZipFilePathPointer(u'/home/alvas/nltk_data/tokenizers/punkt.zip', u'')
When the resource is not available you'll find the error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk-3.0a3-py2.7.egg/nltk/data.py", line 615, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource u'punkt.zip' not found. Please use the NLTK Downloader
to obtain the resource: >>> nltk.download()
Searched in:
- '/home/alvas/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
Most probably, you would like to do something like this to ensure that your collaborators have the package:
>>> try:
... nltk.data.find('tokenizers/punkt')
... except LookupError:
... nltk.download('punkt')
...
[nltk_data] Downloading package punkt to /home/alvas/nltk_data...
[nltk_data] Package punkt is already up-to-date!
True