Is there a way to get the "original" text data for OpenNLP?

I know that this question was asked before - but the answer was not satisfying (in the sense of that the answer was just a link ).

So my question is, is there any way to extend the existing openNLP models? I already know about the technique with DBPedia/Wikipedia. But what if i just want to append some lines of text to improve the models - is there really no way? (If so - that would be really stupid...)

Solution

Unfortunately, you can't. See this question which has a detailed answer to the same problem.

I think, that is a though problem because when you deal with texts you have often licensing issues. For example, you can not build a corpus on Twitter data and publish it to the community (see this paper for some more information).

Therefore, often companies build domain specific corpora and use them internally. For example, we did in our research project. Therefore, we built a tool (Quick Pad Tagger) to create annotated corpora efficiently (see here).