For the task of sentiment analysis on a text, I am using the following annotators to create a pipeline:
annotators = tokenize, ssplit, parse, sentiment
After reading the documentation on annotators, I realized that tokenize and ssplit take the whole text and break it up into separate sentences to be consdiered for further parsing. The problem on which I am working currently is sentiment analysis of tweets. Since tweets most of the times do not exceed more than a line, using a tokenize and ssplit annotator before parse seems overkill.
I tried to exclude the first two but it won't let me do giving out a message Exception in thread "main" java.lang.IllegalArgumentException: annotator "parse" requires annotator "tokenize"
Is there any way to avoid using the tokenize and ssplit annotators to imrpove efficiency ?
Yes, if your text is already tokenized and you have a file with one sentence per line, you can tell the tokenizer to split tokens only at spaces and the sentence splitter to split sentences only at newlines.
The option for the tokenizer is -tokenize.whitespace true
and the option for the sentence splitter -ssplit.eolonly true
.
You can find more information on the options of the tokenizer and the sentence splitter in the CoreNLP documentation.