Search code examples
xmlnlpstanford-nlp

Ignore text inside XML elements when parsing text with Stanford CoreNLP


I'd like to use Stanford CoreNLP to analyze the text content of XML files.

Here's an example of the kind of XML element I'm analyzing:

<cmd>In the new plug-in directory, add a <filepath>cfg/catalog.xml</filepath> file that specifies the custom XSLT style sheets.</cmd>

One thing I'd like to check is whether a <cmd> element contains more than one sentence. Now, if I extract the text content of the <cmd> element above, the result is this:

In the new plug-in directory, add a cfg/catalog.xml file that specifies the custom XSLT style sheets.

If I give that piece of text to Stanford CoreNLP, it thinks the text has two sentences because of the dot in cfg/catalog.xml, even though it's really just one sentence.

In this particular example, I could probably just omit the content of the <filepath> element when analyzing the text and it'd work well enough, but that's not necessarily always the case.

Any suggestions on how to best approach this problem on a general level? I guess I'm looking for a way to either ignore the content of <filepath> and similar elements for certain purposes or somehow force them to be recognized as named entities, if that makes any sense.


Solution

  • You could build an annotator that temporarily replaces the problematic tags/file-names, then restores them after sentence splitting.

    If I get a chance I'll write up some example code.