Search code examples
javanlpopennlptext-segmentation

Sentence detection using NLP


I am trying to parse out sentences from a huge amount of text. using java I started off with NLP tools like OpenNLP and Stanford's Parser.

But here is where i get stuck. though both these parsers are pretty great they fail when it comes to a non uniform text.

For example in my text most sentences are delimited by a period, but in some cases like bullet points they aren't. Here both the parses fail miserably.

I even tried setting the option in the stanford parses for multiple sentence terminators but the output was not much better!

Any ideas??

Edit :To make it simpler I am looking to parse text where the delimiter is either a new line ("\n") or a period(".") ...


Solution

  • First you have to clearly define the task. What, precisely, is your definition of 'a sentence?' Until you have such a definition, you will just wander in circles.

    Second, cleaning dirty text is usually a rather different task from 'sentence splitting'. The various NLP sentence chunkers are assuming relatively clean input text. Getting from HTML, or extracted powerpoint, or other noise, to text is another problem.

    Third, Stanford and other large caliber devices are statistical. So, they are guaranteed to have a non-zero error rate. The less your data looks like what they were trained on, the higher the error rate.