How to subdivide the documents into sentences before Training Mallet LDA

Do you guys have any suggestion for the way that I could possibly subdivide documents into sentences before training MALLET LDA?

Thank you in advance

Solution

You can for instance use the OpenNLP Sentence Detection Tools. They have been around for a while now and perform decently in most cases.

The documentation is here, the models can be downloaded here. Note that version 1.5 models are fully compatible with the newer opennlp-tools version 1.8.4

If you are using Maven, just add the following to your pom.

<dependency>
  <groupId>org.apache.opennlp</groupId>
  <artifactId>opennlp-tools</artifactId>
  <version>1.8.4</version>
</dependency>

If you plan to switch the model input from documents to sentences, please be aware that vanilla LDA (which also affects the current implementation in Mallet, afaik) may not produce satisfactory results since word co-occurrence counts are not very telling in sentences.

I would suggest to investigate whether the paragraph level is more interesting. Paragraphs in documents can be extracted with line break patterns. For instance a new paragraph starts when you have two consecutive line breaks.