I'm trying to use malt parser with the pre made english model. However, I do not know how to convert a text corpus of English sentences into the CoNLL format that is necessary for Malt Parser to operate on. I could not find any documentation on the site. How should I go about this?
Update. I am referring to this post Create .conll file as output of Stanford Parser to create a .conll. However, this is using Stanford Parser.
There is a CoNLL formatting option for CoreNLP output, but unfortunately it doesn't match what MaltParser expects. (Confusingly, there are several different common CoNLL data formats, for the different competition years..)
If you run CoreNLP from the command line with the option -outputFormat conll
, you'll get output in the following TSV format (example output at end of answer):
INDEX WORD LEMMA POS NER DEPHEAD DEPREL
MaltParser expects a bit different format, but you can customize the data input / output format. Try putting this content in maltparser/appdata/dataformat/myconll.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<dataformat name="myconll" reader="tab" writer="tab">
<column name="ID" category="INPUT" type="INTEGER"/>
<column name="FORM" category="INPUT" type="STRING"/>
<column name="LEMMA" category="INPUT" type="STRING"/>
<column name="POSTAG" category="INPUT" type="STRING"/>
<column name="NER" category="IGNORE" type="STRING"/>
<column name="HEAD" category="HEAD" type="INTEGER"/>
<column name="DEPREL" category="DEPENDENCY_EDGE_LABEL" type="STRING"/>
</dataformat>
Then add to your MaltParser config file (find an example config in maltparser/examples/optionexample.xml
):
<?xml version="1.0" encoding="UTF-8"?>
<experiment>
<optioncontainer>
...
<optiongroup groupname="input">
<option name="format" value="myconll"/>
</optiongroup>
</optioncontainer>
...
</experiment>
Then you should be able to provide CoreNLP CoNLL output as training data to MaltParser.
Untested, but if the MaltParser docs are honest, this should work. Sources:
Example CoreNLP CoNLL output (I only used annotators tokenize,ssplit,pos
):
$ echo "This is a test." | java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -outputFormat conll 2>/dev/null
1 This this DT _ _ _
2 is be VBZ _ _ _
3 a a DT _ _ _
4 test test NN _ _ _
5 . . . _ _ _