Search code examples
nlpinformation-extraction

What markup languages are typically used for annotating information extraction corpora


I'm building a corpus for information extraction for extracting specific types of information, and I'm trying to decide the best way to annotate the entities. I have found that the IEER corpus uses SGML tag elements ENAMEX, NUMEX, and TIMEX tags for this (as described here: http://itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html). Since this document was written in 1997, I'm guessing that using this SGML-based approach is quite out of date, and there must be better ways of doing this, e.g. using OWL, RDF, or XML. Is there a more recent industry standard for annotating information extraction corpora?


Solution

  • I would say there isn't enough standardisation in the field but also it is not clear if there needs to be a single format. My advice is to look at the options and choose the one that fits best your data and the information you are encoding.

    brat is the new classic in terms of annotating language resources. It has it's own standoff annotation standard. There is also the Anafora tool which also has it's own XML-based standard. The UIMA-based tools usually use a CAS standard (but bad documentation). You should also look at the native GATE XML format.

    If the information you are encoding is simple enough, like say named entity types, you can even go for a tabular format such as CoNLL.

    If none of those fits your requirements, simply implement whatever does fit them.