Search code examples
javastringtokenizestringtokenizerlarge-data

String tokenization in java (LARGE text)


I have this large text (read LARGE). I need to tokenize every word, delimit on every non-letter. I used StringTokenizer to read one word at a time. However, as I was researching how to write the delimiter string ("every non-letter") instead of doing something like:

new StringTokenizer(text, "\" ();,.'[]{}!?:”“…\n\r0123456789 [etc etc]");

I found that everyone basically hates StringTokenizer (why?).

So, what can I use instead? Dont suggest String.split as it will duplicate my large text. I need to go through the text word by word and delimit on every non-letter. Is it easier to build something on my own or is there some best practice way to confront this problem?

Thanks in advance!


Solution

  • You can use the flexible string Splitter class from Google's library.

    If you need something more powerful, have a look at StandardTokenizer from Apache Lucene. From the docs:

    This should be a good tokenizer for most European-language documents:

    • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
    • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
    • Recognizes email addresses and internet hostnames as one token.