I'm trying to tokenize some sentences. For example the sentences :
String sentence = "The sky is blue. A cat is #blue.";
I use the following command with Open nlp:
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] result = tokenizer.tokenize(sentence);
But I want opennlp considers '#
' as a letter of a word. So '#blue#
' will be a token.
How to do this ?
You just have to create a new Tokenizer object (implementing Tokenizer).
Tokenizer t = new Tokenizer() {
@Override
public Span[] tokenizePos(String arg0) {
}
@Override
public String[] tokenize(String arg0) {
}
};
Then, Copy/Paste the SimpleTokenizer code into thoses 2 functions.
And Associate the '#' to others alphanumericals values :
if (StringUtil.isWhitespace(c)) {
charType = CharacterEnum.WHITESPACE;
} else if (Character.isLetter(c) || c=='#') {
charType = CharacterEnum.ALPHABETIC;
} else if (Character.isDigit(c)) {
charType = CharacterEnum.NUMERIC;
} else {
charType = CharacterEnum.OTHER;
}