Search code examples
pythonpython-3.xnlpnltkpos-tagger

Using NLTK's universalt tagset with non-English corpora


I'm using NLTK (3.0.4-1) in Python 3.4.3+ and I'd like to proccess some of the tagged corpora using the universal tagset (which I had to install), as explained in NLTK book, chapter 5.

I can access any of these corpora with their original PoS tagset, e.g.:

from nltk.corpus import brown, cess_esp, floresta

print(brown.tagged_sents()[0])
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]

print(cess_esp.tagged_sents()[0])
[('El', 'da0ms0'), ('grupo', 'ncms000'), ('estatal', 'aq0cs0'), ('Electricité_de_France', 'np00000'), ('-Fpa-', 'Fpa'), ('EDF', 'np00000'), ('-Fpt-', 'Fpt'), ('anunció', 'vmis3s0'), ('hoy', 'rg'), (',', 'Fc'), ('jueves', 'W'), (',', 'Fc'), ('la', 'da0fs0'), ('compra', 'ncfs000'), ('del', 'spcms'), ('51_por_ciento', 'Zp'), ('de', 'sps00'), ('la', 'da0fs0'), ('empresa', 'ncfs000'), ('mexicana', 'aq0fs0'), ('Electricidad_Águila_de_Altamira', 'np00000'), ('-Fpa-', 'Fpa'), ('EAA', 'np00000'), ('-Fpt-', 'Fpt'), (',', 'Fc'), ('creada', 'aq0fsp'), ('por', 'sps00'), ('el', 'da0ms0'), ('japonés', 'aq0ms0'), ('Mitsubishi_Corporation', 'np00000'), ('para', 'sps00'), ('poner_en_marcha', 'vmn0000'), ('una', 'di0fs0'), ('central', 'ncfs000'), ('de', 'sps00'), ('gas', 'ncms000'), ('de', 'sps00'), ('495', 'Z'), ('megavatios', 'ncmp000'), ('.', 'Fp')]

print(floresta.tagged_sents()[0])
[('Um', '>N+art'), ('revivalismo', 'H+n'), ('refrescante', 'N<+adj')]

So far so good, but when I use the option tagset='universal' to access the simplified version of the PoS tags, it works only for the Brown corpus.

print(brown.tagged_sents(tagset='universal')[0])
[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')]

When accessing the corpora in Spanish and Portuguese I get a long chain of errors and a LookupError exception.

print(cess_esp.tagged_sents(tagset='universal')[0])
---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
<ipython-input-6-4e2e43e54e2d> in <module>()
----> 1 print(cess_esp.tagged_sents(tagset='universal')[0])

[...]

LookupError: 
**********************************************************************
  Resource 'taggers/universal_tagset/unknown.map' not found.
  Please use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/victor/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

Among the mappings located in my taggers/universal_tagset directory, I can find the mappings for Spanish (es-cast3lb.map) and Portuguese (pt-bosque.map), but I don't have any unknown.map file. Any ideas how to solve it?

Thanks in advance :-)


Solution

  • That's an interesting question. The NLTK implements mapping to the Universal tagset only for a fixed collection of corpora, with the help of the fixed maps you found in nltk_data/taggers/universal_tagset/. Except for a few special cases (which include treating the brown corpus as if it was named en-brown), the rule is to look for a mapping file that has the same name as the tagset used for your corpus. In your case, the tagset is set to "unknown", which is why you see that message.

    Now, are you sure "the mapping for Spanish", i.e. the map es-cast3lb.map, actually matches the tagset for your corpus? I certainly wouldn't just assume it does, since any project can create their own tagset and rules for use. If this is the same tagset your corpus uses, your problem has an easy solution:

    • When you initialize your corpus reader, e.g. cess_esp, add the option tagset="es-cast3lb" to the constructor. If necessary, e.g. for corpora already loaded by the NLTK with tagset="unknown", you can override the tagset after initialization like this:

      cess_esp._tagset = "es-cast3lb"
      

    This tells the corpus reader what tagset is used in the corpus. After that, specifying tagset="universal" should cause the selected mapping to be applied.

    If this tagset is not actually suited to your corpus, your first job is to study the documentation of the tagset for your corpus, and create an appropriate mapping to the Universal tagset; as you've probably seen, the format is pretty trivial. You can then put your mapping in operation by dropping it in nltk_data/taggers/universal_tagset. Adding your own resources to the nltk_data area is decidedly a hack, but if you get this far, I recommend you contribute your tagset map to the nltk... which will resolve the hack after the fact.

    Edit: So (per the comments) it's the right tagset, but only the 1-2 letter POS tags are in the mapping dictionary (the rest of the tag presumably describes the features of inflected words). Here's a quick way to extend the mapping dictionary on the fly, so that you can see the universal tags:

    import nltk
    from nltk.corpus import cess_esp
    cess_esp._tagset = "es-cast3lb"
    
    nltk.tag.mapping._load_universal_map("es-cast3lb")  # initialize; normally loaded on demand
    mapdict = nltk.tag.mapping._MAPPINGS["es-cast3lb"]["universal"] # shortcut to the map
    
    alltags = set(t for w, t in cess_esp.tagged_words())
    for tag in alltags:
        if len(tag) <= 2:   # These are complete
            continue
        mapdict[tag] = mapdict[tag[:2]]
    

    This discards the agreement information. If you'd rather decorate the "universal" tags with it, just set mapdict[tag] to mapdict[tag[:2]]+"-"+tag[2:].

    I'd save this dictionary to a file as described above, so that you don't have to recompute the mapping every time you load your corpus.