Search code examples
pythonnltkwordnet

How to use Wordnet 3.1 with NLTK on Python?


Important Edit

As informed by @Pengin in comments. NLTK is supporting WordNet 3.1 from January 2022. Thus this question is deemed irrelevant now.


I need to use Wordnet 3.1 for my research work, but NLTK (python) ships with the default wordnet version: 3.0. It is important that I use the latest version of Wordnet.

>>> from nltk.corpus import wordnet
>>> wordnet.get_version()
'3.0'

But, since NLTK 3.1 is the latest version, and I cannot find any way to download and access it using nltk.download(), I am searching for a workaround.

As written in Wordnet Website (current version link here), I am quoting below:

WordNet 3.1 DATABASE FILES ONLY

You can download the WordNet 3.1 database files. Note that this is not a full package as those above, nor does it contain any code for running WordNet. However, you can replace the files in the database directory of your 3.0 local installation with these files and the WordNet interface will run, returning entries from the 3.1 database. This is simply a compressed tar file of the WordNet 3.1 database files.

I tried downloading the Wordnet 3.1 database files and replaced them with the default Wordnet files at C:\Users\<username>\AppData\Roaming\nltk_data\corpora (on Windows system). I doubted that it won't work as the instructions are to replace the database file in the Wordnet software installation, but still, I tried.

On running wordnet.get_version(), I am getting the following error.

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-d64ae1e68b36> in <module>
----> 1 wordnet.get_version()

~\anaconda3\lib\site-packages\nltk\corpus\util.py in __getattr__(self, attr)
    118             raise AttributeError("LazyCorpusLoader object has no attribute '__bases__'")
    119 
--> 120         self.__load()
    121         # This looks circular, but its not, since __load() changes our
    122         # __class__ to something new:

~\anaconda3\lib\site-packages\nltk\corpus\util.py in __load(self)
     86 
     87         # Load the corpus.
---> 88         corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)
     89 
     90         # This is where the magic happens!  Transform ourselves into

~\anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py in __init__(self, root, omw_reader)
   1136 
   1137         # Load the lexnames
-> 1138         for i, line in enumerate(self.open("lexnames")):
   1139             index, lexname, _ = line.split()
   1140             assert int(index) == i

~\anaconda3\lib\site-packages\nltk\corpus\reader\api.py in open(self, file)
    206         """
    207         encoding = self.encoding(file)
--> 208         stream = self._root.join(file).open(encoding)
    209         return stream
    210 

~\anaconda3\lib\site-packages\nltk\data.py in join(self, fileid)
    335     def join(self, fileid):
    336         _path = os.path.join(self._path, fileid)
--> 337         return FileSystemPathPointer(_path)
    338 
    339     def __repr__(self):

~\anaconda3\lib\site-packages\nltk\compat.py in _decorator(*args, **kwargs)
     39     def _decorator(*args, **kwargs):
     40         args = (args[0], add_py3_data(args[1])) + args[2:]
---> 41         return init_func(*args, **kwargs)
     42 
     43     return wraps(init_func)(_decorator)

~\anaconda3\lib\site-packages\nltk\data.py in __init__(self, _path)
    313         _path = os.path.abspath(_path)
    314         if not os.path.exists(_path):
--> 315             raise IOError("No such file or directory: %r" % _path)
    316         self._path = _path
    317 

OSError: No such file or directory: 'C:\\Users\\Punit Singh\\AppData\\Roaming\\nltk_data\\corpora\\wordnet\\lexnames'

Then I checked for the file structure and I am listing the before and after tree below.

File Tree In Wordnet 3.0

wordnet
├── adj.exc
├── adv.exc
├── citation.bib
├── cntlist.rev
├── data.adj
├── data.adv
├── data.noun
├── data.verb
├── index.adj
├── index.adv
├── index.noun
├── index.sense
├── index.verb
├── lexnames
├── LICENSE
├── noun.exc
├── README
├── verb.exc

File Tree In Wordnet 3.1

wordnet
├── adj.exc
├── adv.exc
├── cntlist
├── cntlist.rev
├── cousin.exc
├── data.adj
├── data.adv
├── data.noun
├── data.verb
├── index.adj
├── index.adv
├── index.noun
├── index.sense
├── index.verb
├── log.grind.3.1
├── noun.exc
├── sentidx.vrb
├── dbfiles
    ├── adj.all
    ├── adj.pert
    ├── adj.ppl
    ├── adv.all
    ├── cntlist
    ├── noun.act
    ├── noun.animal
    ├── noun.artifact
    ├── noun.attribute
    ├── noun.body
    ├── noun.cognition
    ├── noun.communication
    ├── noun.event
    ├── noun.feeling
    ├── noun.food
    ├── noun.group
    ├── noun.location
    ├── noun.motive
    ├── noun.object
    ├── noun.person
    ├── noun.phenomenon
    ├── noun.plant
    ├── noun.possession
    ├── noun.process
    ├── noun.quantity
    ├── noun.relation
    ├── noun.shape
    ├── noun.state
    ├── noun.substance
    ├── noun.time
    ├── noun.Tops
    ├── verb.body
    ├── verb.change
    ├── verb.cognition
    ├── verb.communication
    ├── verb.competition
    ├── verb.consumption
    ├── verb.contact
    ├── verb.creation
    ├── verb.emotion
    ├── verb.Framestext
    ├── verb.motion
    ├── verb.perception
    ├── verb.possession
    ├── verb.social
    ├── verb.stative
    ├── verb.weather

Any suggestions or solutions on how to use Wordnet 3.1 with NLTK (Python) will be helpful.

Thanks in advance.


Solution

  • After a lot of searching and trial and error, I was able to use Wordnet 3.1 on NLTK (Python). I tweaked this gist to make it work. I am providing the details below.

    I divided the code provided in the gist in 3 parts.

    Part 1. download_extract.py

    import os
    
    nltkdata_wn = '/path/to/nltk_data/corpora/wordnet/'
    wn31 = "http://wordnetcode.princeton.edu/wn3.1.dict.tar.gz"
    
    if not os.path.exists(nltkdata_wn+'_3.0'):
        os.mkdir(nltkdata_wn+'_3.0')
    os.system('mv '+nltkdata_wn+"* "+nltkdata_wn+"_3.0/")
    
    if not os.path.exists('wn3.1.dict.tar.gz'):
        os.system('wget '+wn31)
    
    os.system("tar zxf wn3.1.dict.tar.gz -C "+nltkdata_wn)
    os.system("mv "+nltkdata_wn+"dict/* "+nltkdata_wn)
    os.rmdir(nltkdata_wn + 'dict')
    

    This is used to back up the existing Wordnet 3.0 folder from wordnet to wordnet_3.0, download the Wordnet 3.1 database, and put it in folder wordnet. Since I am on a Windows system, I did this manually.

    Part 2. create_lexnames.py

    import os
    
    nltkdata_wn = '/path/to/nltk_data/corpora/wordnet/'
    dbfiles = nltkdata_wn+'dbfiles'
    
    with open(nltkdata_wn+'lexnames', 'w') as fout:
        for i,j in enumerate(sorted(os.listdir(dbfiles))):
            pos = j.partition('.')[0]
            if pos == "noun":
                syncat = 1
            elif pos == "verb":
                syncat = 2
            elif pos == "adj":
                syncat = 3
            elif pos == "adv":
                syncat = 4
            elif j == "cntlist":
                syncat = "cntlist"
            fout.write("\t".join([str(i).zfill(2),j,str(syncat)])+"\n")
    

    This creates the required lexnames file in the wordnet folder.

    Part 3. testing_wn31.py

    from nltk.corpus import wordnet as wn
    
    nltkdata_wn = '/path/to/nltk_data/corpora/wordnet/'
    
    # Checking generated lexnames file.
    for i, line in enumerate(open(nltkdata_wn + 'lexnames','r')):
        index, lexname, _ = line.split()
        ##print line.split(), int(index), i
        assert int(index) == i
    
    # Testing wordnet function.
    print(wn.synsets('dog'))
    for i in wn.all_synsets():
        print(i, i.pos(), i.definition())
    

    This tested the generated lexname file and also tested if the wordnet functions are working fine.

    Once I am done with this procedure, I ran following code in python and found that it is actually running version 3.1

    >>> from nltk.corpus import wordnet
    >>> wordnet.get_version()
    '3.1'
    

    A Word of Caution

    Once you replace the Wordnet 3.1 database, you'll notice that if you run the following code

    >>> import nltk
    >>> nltk.download()
    

    in the download dialog box, you will see that under Corpora tab, Wordnet will be shown as out of date, you should not try to update it as it will either replace the wordnet to version 3.0 or break it.