Search code examples
javatokenizeopennlp

Tokenize words ignoring hashtags with Open nlp


I'm trying to tokenize some sentences. For example the sentences :

String sentence = "The sky is blue. A cat is #blue.";

I use the following command with Open nlp:

SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] result = tokenizer.tokenize(sentence);

But I want opennlp considers '#' as a letter of a word. So '#blue#' will be a token.

How to do this ?


Solution

  • You just have to create a new Tokenizer object (implementing Tokenizer).

    Tokenizer t = new Tokenizer() {
    
        @Override
        public Span[] tokenizePos(String arg0) {
    
    
        }
    
        @Override
            public String[] tokenize(String arg0) {
    
    
            }
    };
    

    Then, Copy/Paste the SimpleTokenizer code into thoses 2 functions.

    And Associate the '#' to others alphanumericals values :

    if (StringUtil.isWhitespace(c)) {
        charType = CharacterEnum.WHITESPACE;
    } else if (Character.isLetter(c) || c=='#') {
        charType = CharacterEnum.ALPHABETIC;
    } else if (Character.isDigit(c)) {
        charType = CharacterEnum.NUMERIC;
    } else {
        charType = CharacterEnum.OTHER;
    }