Search code examples
javamachine-learningtokenizestanford-nlp

Text tokenization with Stanford NLP : Filter unrequired words and characters


I use Stanford NLP for string tokenization in my classification tool. I want to get only meaningful words, but I get non-word tokens (like ---, >, . etc.) and not important words like am, is, to (stop words). Does anybody know a way to solve this problem?


Solution

  • This is a very domain-specific task that we don't perform for you in CoreNLP. You should be able to make this work with a regular expression filter and a stopword filter on top of the CoreNLP tokenizer.

    Here's an example list of English stopwords.