Search code examples
nlpopennlp

OpenNLP POSTaggerME and ChunkerME synergy


I'm trying to use the OpenNLP chunking API to chunk a portuguese sentence. So, first I tokenized a sentence using TokenizerME, then I tagged it with POSTaggerME. For both I used the ready-made models provided by the project here.

For the sentence “Ivo viu a uva”, POSTaggerME returns the tags [PROPN, VERB, DET, NOUN]. The model seems to be using the UD POS Tags.

As there is no ready-made model for ChunkerME in portuguese, I followed the instructions and did the training first using the ChunkerConverter tool (to convert from "arvore deitada" to CoNLL2000) and then generating the model with ChunkerTrainerME tool. Everything worked well. For the sentence above, the chunker produced correct tags ([B-NP, B-VP, B-NP, I-NP]).

But, for more complex sentences, it hasn't produced such good results.

I was trying to identify what I could improve in chunker training, and one of the things I noticed is that there is a difference between the types of tags. The portuguese corpus (Bosque 8.0) seems to be using portuguese tags. For example, instead of PROPN, the corpus uses prop and instead of DET, it uses art.

It seems to me that this could lead to problems, especially since one of the parameters the chunker receives is an array with UD tags, but it has been trained with another type of tag...

But before writing code creating a routine to convert from a portuguese notation to UD (or Penn) I wanted to ask, if

  1. this does indeed have an impact,
  2. there is a tool that already does this translation and
  3. there are any other suggestions for improving the chunker precision/recall.

Solution

  • Q1

    Yes, the chosen tag set (UD, Penn, custom) has an impact. Conversion is not possible in a bi-directional manner:

    • Penn -> UD should work well.
    • UD -> Penn is not a good idea as it a lossy conversion. UD tag set are less detailed when compared to the "classic' Penn tag set.

    Using a custom, language specific tag-set can work, but it is a matter of "mapping" from/to UD correctly. This might work for some tag sets and languages, for others it might be too complicated / lossy.

    Q2

    No, there isn't. The OpenNLP project takes code donations for upcoming releases, if you want to provide such a mapping/translation for PT lang.

    Q3

    This needs details/discussion on the Apache OpenNLP user and/or dev mailing lists. Alternatively, feel free to open a Jira issue if you can drill the topic down to a clear idea or proposed code addition.