Search code examples
javaregexcsvcapturing-group

Java Regex, capturing groups with comma separated values


InputString: A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him .

ExpectedOutput:
bruises
wounds
marks
dislocations
Injuries

Generalized Pattern Tried:

       ".[\s]?(\w+?)"+                 // bruises.
      "(?:(\s)?,(\s)?(\w+?))*"+             // wounds marks dislocations
      "[\s]?(?:or|and) other (\w+).";     // Injuries

The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him.

On trying the generalized pattern above, the output is: bruises dislocations Injuries

There is something wrong with the capturing group for "(?:(\s)?,(\s)?(\w+?))*". The capturing group has one more occurences.. but it returns only "dislocations". "marks" and "dislocation: are devoured.

Could you please suggest what should be the right pattern, and where is the mistake? This question comes closest to this question, but that solution didn't help.

Thanks.


Solution

  • When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP.

    How to fix: (?:(\s)?,(\s)?(\w+?))*

    Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match.

    And for anything more complicated re: NLP, GATE is a pretty powerful tool. The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/