Search code examples
javanlpdeserializationparse-treeclearnlp

How to deserialize a CoNLL format dependency tree with ClearNLP?


Dependency parsing using ClearNLP creates a DEPTree object. I have parsed a large corpus and serialized all the data in CoNLL format (e.g., this ClearNLP page on Google code).

But I can't figure out how to deserialize them. ClearNLP provides a DEPTree#toStringCoNLL() method (scroll down this page to see it). I am looking for something to read a CoNLL format parse tree and create a DEPTree object. I tried to reverse-engineer it, but didn't really understand the inner workings of the code.

I have, instead, created my own dependency tree class to handle the basic functionalities I need, but I would really like to know how to get a DEPTree object instead. So far, I haven't found any method in their API which does this.


Solution

  • Found the answer, so sharing the wisdom on SO :-) ...

    The deserialization can be done using the TSVReader in the edu.emory.clir.clearnlp.reader package.

    public void readCoNLL(String inputFile) throws Exception {
        TSVReader reader = new TSVReader(0, 1, 2, 4, 5, 6, 7);
        reader.open(new FileInputStream(inputFile));
        DEPTree tree;
        while ((tree = reader.next()) != null)
            System.out.println(tree.toString(DEPNode::toStringDEP));
    }
    

    This is provided here by the author of ClearNLP, Jinho Choi.

    In older versions (< 3.x) you will need to use the com.clearnlp.reader.DEPReader class instead of TSVReader.