I am practicing MapReduce with Cloudera turotial here. However, currently the tutorial only split words by space with this regex in Java:
private static final Pattern WORD_BOUNDARY = Pattern.compile("\\s*\\b\\s*");
However, in addition to space "\\s*"
, I also want to define separate words by comma, period(.) and tab(\t), parentheses(), brackets[], and curly braces({}) characters. In other words, I define a word as a string that has one or more alphanumeric characters bounded by two non alphanumeric characters. For example:
()
{]
<space>
and )
So how should my regex be written in order to obtain this requirement?
If you define a word as one or more consecutive alphanumeric characters, then split on one or more consecutive non-alphanumeric characters, i.e. "\\P{Alnum}+"
or "[^a-zA-Z0-9]+"
.
See regex101 for example.
You can prefix the first one with (?U)
, i.e. "(?U)\\P{Alnum}+"
, for full international unicode support.