Search code examples
parsingnlpartificial-intelligencesemanticsstanford-nlp

sentence boundry detection in noisy or ASR data


There are many tools and paper available which perform this task using basic sentence seperators.

Such tools are

  1. http://nlp.stanford.edu/software/tokenizer.shtml

  2. OpenNLP

  3. NLTK

and there might be other. They mainly focus on

(a) If it's a period, it ends a sentence.
(b) If the preceding token is on my hand-compiled list of abbreviations, then it doesn't end a sentence.
(c) If the next token is capitalized, then it ends a sentence. 

There are few paper which suggest techniques for SBD in ASR text

http://pdf.aminer.org/000/041/703/experiments_on_sentence_boundary_detection.pdf

http://www.icsd.aegean.gr/lecturers/kavallieratou/publications_files/icpr_2000.pdf

http://www.icsd.aegean.gr/lecturers/kavallieratou/publications_files/icpr_2000.pdf

Is there any tools which can perform sentence detection on ambiguous sentences like

  1. John is actor and his father Mr Smith was top city doctor in NW (2 sentences)

  2. Where is statue of liberty, what is it's height and what is the history behind? (3 sentences)


Solution

  • What you are seeking to do is to identify the independent clauses in a compound sentence. A compound sentence is a sentence with at least two independent clauses joined by a coordinating conjunction. There is no readily available tool for this, but you can identify compound sentences with a high degree of precision by using constituency parse trees.

    Be wary, though. Sligh grammatical mistakes can yield a very wrong parse tree! For example, if you use the Berkeley parser (demo page: http://tomato.banatao.berkeley.edu:8080/parser/parser.html) on your first example, the parse tree is not what you would expect, but correct it to "John is an actor and his father ... ", and you can see the parse tree neatly divided into the structure S CC S:

    The Berkeley Parser's output on the first sentence

    Now, you simply take each sentence-label S as an independent clause!

    Questions are not handled well, I am afraid, as you can check with your second example.