Search code examples
parsingnlpstanford-nlppos-tagger

How to convert text file to CoNLL format for malt parser?


I'm trying to use malt parser with the pre made english model. However, I do not know how to convert a text corpus of English sentences into the CoNLL format that is necessary for Malt Parser to operate on. I could not find any documentation on the site. How should I go about this?

Update. I am referring to this post Create .conll file as output of Stanford Parser to create a .conll. However, this is using Stanford Parser.


Solution

  • There is a CoNLL formatting option for CoreNLP output, but unfortunately it doesn't match what MaltParser expects. (Confusingly, there are several different common CoNLL data formats, for the different competition years..)

    If you run CoreNLP from the command line with the option -outputFormat conll, you'll get output in the following TSV format (example output at end of answer):

    INDEX    WORD    LEMMA    POS    NER    DEPHEAD    DEPREL
    

    MaltParser expects a bit different format, but you can customize the data input / output format. Try putting this content in maltparser/appdata/dataformat/myconll.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <dataformat name="myconll" reader="tab" writer="tab">
        <column name="ID" category="INPUT" type="INTEGER"/>
        <column name="FORM" category="INPUT" type="STRING"/>
        <column name="LEMMA" category="INPUT" type="STRING"/>
        <column name="POSTAG" category="INPUT" type="STRING"/>
        <column name="NER" category="IGNORE" type="STRING"/>
        <column name="HEAD" category="HEAD" type="INTEGER"/>
        <column name="DEPREL" category="DEPENDENCY_EDGE_LABEL" type="STRING"/>
    </dataformat>
    

    Then add to your MaltParser config file (find an example config in maltparser/examples/optionexample.xml):

    <?xml version="1.0" encoding="UTF-8"?>
    <experiment>
        <optioncontainer>
    ...
            <optiongroup groupname="input">
                <option name="format" value="myconll"/>
            </optiongroup>
        </optioncontainer>
    ...
    </experiment>
    

    Then you should be able to provide CoreNLP CoNLL output as training data to MaltParser.

    Untested, but if the MaltParser docs are honest, this should work. Sources:


    Example CoreNLP CoNLL output (I only used annotators tokenize,ssplit,pos):

    $ echo "This is a test." | java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -outputFormat conll 2>/dev/null
    
    1   This    this    DT  _   _   _
    2   is  be  VBZ _   _   _
    3   a   a   DT  _   _   _
    4   test    test    NN  _   _   _
    5   .   .   .   _   _   _