I wonder if there is some java based language utility out there that can help do the following string tokenize or word break and noise removing
So for a string
Friday's meeting is wonderful
expected result will be a series of words
Friday meeting wonderful
where the 's and is got removed
And for string
I went to the farmer's market
expected result will be words
went farmer market
where I, to, the, and 's got removed
There is no general solution to this problem, because (not least) your notion of "noise" is ill-defined ... and most likely different to other peoples'.
If I was implementing this (and I agreed with your notion of "noise") I would:
In short, you are going to have to write a non-trivial amount of code to do this.
Of course, stripping the "noise" words is strip information that is relevant to a proper semantic analysis of the text. ("I hit the ball" and "You hit a ball" are saying different things.)