Search code examples
javanormalizationwordbreaker

java tokenizer or word breaker, and for different languages


I wonder if there is some java based language utility out there that can help do the following string tokenize or word break and noise removing

So for a string

Friday's meeting is wonderful

expected result will be a series of words

Friday meeting wonderful

where the 's and is got removed

And for string

I went to the farmer's market 

expected result will be words

went farmer market

where I, to, the, and 's got removed


Solution

  • There is no general solution to this problem, because (not least) your notion of "noise" is ill-defined ... and most likely different to other peoples'.

    If I was implementing this (and I agreed with your notion of "noise") I would:

    1. Tokenize using whitespace and accepted punctuation as delimiters.
    2. Strip quotes
    3. Strip apostrophies
    4. Normalize hyphenation (maybe just remove the hyphens)
    5. Use a stop-word filter to get rid of the "noise" words.

    In short, you are going to have to write a non-trivial amount of code to do this.


    Of course, stripping the "noise" words is strip information that is relevant to a proper semantic analysis of the text. ("I hit the ball" and "You hit a ball" are saying different things.)