Search code examples
javaregexstringhadoop2word-count

How to separate word by comma, space, period(.), tab(\t), parentheses(), brackets[], and curly braces({}) characters in wordcount hadoop?


I am practicing MapReduce with Cloudera turotial here. However, currently the tutorial only split words by space with this regex in Java:

private static final Pattern WORD_BOUNDARY = Pattern.compile("\\s*\\b\\s*");

However, in addition to space "\\s*", I also want to define separate words by comma, period(.) and tab(\t), parentheses(), brackets[], and curly braces({}) characters. In other words, I define a word as a string that has one or more alphanumeric characters bounded by two non alphanumeric characters. For example:

  • (cece54) has one word "cece54" bounded by ()
  • {dwd] has one word "dwd" bounded by {]
  • xxx) has one word "xxx" bound by <space> and )
  • so on and so forth.

So how should my regex be written in order to obtain this requirement?


Solution

  • If you define a word as one or more consecutive alphanumeric characters, then split on one or more consecutive non-alphanumeric characters, i.e. "\\P{Alnum}+" or "[^a-zA-Z0-9]+".

    See regex101 for example.

    You can prefix the first one with (?U), i.e. "(?U)\\P{Alnum}+", for full international unicode support.