I use Stanford NLP
for string tokenization in my classification tool. I want to get only meaningful words, but I get non-word tokens (like ---
, >
, .
etc.) and not important words like am
, is
, to
(stop words). Does anybody know a way to solve this problem?
This is a very domain-specific task that we don't perform for you in CoreNLP. You should be able to make this work with a regular expression filter and a stopword filter on top of the CoreNLP tokenizer.