Search code examples
pythonnlpnltkwordnet

Missing Spanish wordnet from NLTK


I am trying to use the Spanish Wordnet from the Open Multilingual Wordnet in NLTK 3.0, but it seems that it was not downloaded with the 'omw' package. For example, with a code like the following:

from nltk.corpus import wordnet as wn

print [el.lemma_names('spa') for el in wn.synsets('bank')]

I get the following error message:

IOError: No such file or directory: u'***/nltk_data/corpora/omw/spa/wn-data-spa.tab'

According to the documentation, Spanish should be included, in the 'omw' package, but it was not downloaded with it. Do you know why this could happen?


Solution

  • Here's the full error traceback if a language is missing from the Open Multilingual WordNet in your nltk_data directory:

    >>> from nltk.corpus import wordnet as wn
    >>> wn.synsets('bank')[0].lemma_names('spa')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 418, in lemma_names
        self._wordnet_corpus_reader._load_lang_data(lang)
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1070, in _load_lang_data
        f = self._omw_reader.open('{0:}/wn-data-{0:}.tab'.format(lang))
      File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 198, in open
        stream = self._root.join(file).open(encoding)
      File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 309, in join
        return FileSystemPathPointer(_path)
      File "/usr/local/lib/python2.7/dist-packages/nltk/compat.py", line 380, in _decorator
        return init_func(*args, **kwargs)
      File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 287, in __init__
        raise IOError('No such file or directory: %r' % _path)
    IOError: No such file or directory: u'/home/alvas/nltk_data/corpora/omw/spa/wn-data-spa.tab'
    

    So the first thing is to check whether it's installed automatically:

    >>> import nltk
    >>> nltk.download('omw')
    [nltk_data] Downloading package omw to /home/alvas/nltk_data...
    [nltk_data]   Package omw is already up-to-date!
    Tru
    

    Then you should go and check the nltk_data and find that 'spa' folder is missing:

    alvas@ubi:~/nltk_data/corpora/omw$ ls
    als  arb  cmn  dan  eng  fas  fin  fra  fre  heb  ita  jpn  mcr  msa  nor  pol  por  README  tha
    

    So here's the short term solution:

    $ wget http://compling.hss.ntu.edu.sg/omw/wns/spa.zip
    $ mkdir ~/nltk_data/corpora/omw/spa
    $ unzip -p spa.zip mcr/wn-data-spa.tab > ~/nltk_data/corpora/omw/spa/wn-data-spa.tab
    

    Alternatively, you can simply copy the file from nltk_data/corpora/omw/mcr/wn-data-spa.tab.

    [out]:

    >>> from nltk.corpus import wordnet as wn
    >>> wn.synsets('bank')[0].lemma_names('spa')
    [u'margen', u'orilla', u'vera']
    

    Now the lemma_names() should work for Spanish, if you're looking for other languages from the Open Multilingusl Wordnet, you can browse here (http://compling.hss.ntu.edu.sg/omw/) and then download and put in the respective nltk_data directory.

    The long term solution would be to ask the devs from NLTK and OMW project to update their datasets for their NLTK API.