Search code examples
javaregexjava.util.scannertext-formatting

use of delimiter function from scanner for "abc-def"


I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.

scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));

The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.

How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?

Thanks ;)


Solution

  • OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?

    Example:

    This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.

    So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".

    "[.,:;()?!\"\s]+|\s+-\s+"