Search code examples
javanlppos-taggerdependency-parsingmalt-parser

Issues Regarding Training Maltparser Model


I am trying to train a Maltparser Model for Bangla. I have annotated a small Corpus in Conllu Format. But it it gives me null pointer error. So i tried it with some treebank collected from UD website. And it works on those dataset. My questions are

  1. Can i train Maltparser Model without XPOSTAG, i have annotated the UPOSTAG field and XPOSTAG field is just copies of UPOSTAG. Do i need to annotate XPOSTAG? This is the only difference between my treebank and UD treebank

  2. As it is for evaluation purpose can i automatically convert UPOSTAG to XPOSTAG?

ref: http://universaldependencies.org/format.html

For better understanding i am giving example of both my bank and UD bank

My Example Bank(There are mistakes and some empty fields)(Language is Bangla)

1   Ajake   _   NOUN    NOUN    _   5   iobj    _   _
2   rAtera  _   NOUN    NOUN    _   1   nmod    _   _
3   AbahAoYA    _   NOUN    NOUN    _   5   nsubj   _   _
4   kemana  _   ADV ADV _   5   advmod  _   _
5   hate    _   VERB    VERB    _   0   root    _   _
6   pAre    _   AUX AUX _   5   aux _   SpaceAfter=No
7   ?   _   _   _   _   _   _   _   _

1   Ajake   _   NOUN    NOUN    _   5   iobj    _   _
2   bikAlera    _   NOUN    NOUN    _   1   nmod    _   _
3   paribesha   _   NOUN    NOUN    _   5   nsubj   _   _
4   kemana  _   ADV ADV _   5   advmod  _   _
5   hate    _   VERB    VERB    _   0   root    _   _
6   pAre    _   AUX AUX _   5   aux _   SpaceAfter=No
7   ?   _   _   _   _   _   _   _   _

UD Bank

1   From    _   ADP IN  _   3   case    _   _
2   the _   DET DT  _   3   det _   _
3   AP  _   PROPN   NNP _   4   nmod    _   _
4   comes   _   VERB    VBZ _   0   root    _   _
5   this    _   DET DT  _   6   det _   _
6   story   _   NOUN    NN  _   4   nsubj   _   _
7   :   _   PUNCT   :   _   4   punct   _   _

1   President   _   PROPN   NNP _   2   compound    _   _
2   Bush    _   PROPN   NNP _   5   nsubj   _   _
3   on  _   ADP IN  _   4   case    _   _
4   Tuesday _   PROPN   NNP _   5   nmod    _   _
5   nominated   _   VERB    VBD _   0   root    _   _
6   two _   NUM CD  _   7   nummod  _   _
7   individuals _   NOUN    NNS _   5   dobj    _   _
8   to  _   PART    TO  _   9   mark    _   _
9   replace _   VERB    VB  _   5   advcl   _   _
10  retiring    _   VERB    VBG _   11  amod    _   _
11  jurists _   NOUN    NNS _   9   dobj    _   _
12  on  _   ADP IN  _   14  case    _   _
13  federal _   ADJ JJ  _   14  amod    _   _
14  courts  _   NOUN    NNS _   11  nmod    _   _
15  in  _   ADP IN  _   18  case    _   _
16  the _   DET DT  _   18  det _   _
17  Washington  _   PROPN   NNP _   18  compound    _   _
18  area    _   NOUN    NN  _   14  nmod    _   _
19  .   _   PUNCT   .   _   5   punct   _   _

Solution

  • Ok i found the solution for first problem. You don't need XPOSTAG, duplicating UPOSTAG will allow training. my problem was that no word or punctuation mark, "?" in the question, can be left blank.it has to be pos tagged and must be made dependent on the root. It solved my issues.

    In case of the second question the answer is ambiguous. There is no valid one to one relationship between UPOSTAG and XPOSTAG as it is language dependent. Any table using the Penn Tree Bank tags will work. But will need post-processing for accuracy.