I have this large text (read LARGE). I need to tokenize every word, delimit on every non-letter. I used StringTokenizer to read one word at a time. However, as I was researching how to write the delimiter string ("every non-letter") instead of doing something like:
new StringTokenizer(text, "\" ();,.'[]{}!?:”“…\n\r0123456789 [etc etc]");
I found that everyone basically hates StringTokenizer (why?).
So, what can I use instead? Dont suggest String.split as it will duplicate my large text. I need to go through the text word by word and delimit on every non-letter. Is it easier to build something on my own or is there some best practice way to confront this problem?
Thanks in advance!
You can use the flexible string Splitter class from Google's guava library.
If you need something more powerful, have a look at StandardTokenizer from Apache Lucene. From the docs:
This should be a good tokenizer for most European-language documents:
- Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.