Search code examples
javaregexgroovy

Regex for complex delimited string with multiple parse patterns


I have the following string:

def str='prop1: value1, prop2: value2;value3, prop3:"test:1234, test1:23;45, test2:34;34", prop4: "test1:66;77, 888"'

what I want to end up with is the following list of pairs

prop1: value1
prop2: value2;value3
prop3: test:1234, test1:23;45, test4:34;34
prop4: test, 66;77, 888

I figure if I can first parse and strip out props3 and 4, then I can simply split on comma for the rest of the string. but having a problem with being able to get a match for prop 4

The following is the code and regex I have tried so far. Commented out in the code are various regex I have tried but have not been able to extract the last prop4

  def str='prop1: value1, prop2: value2;value3, prop3:"test:1234, test1:23;45, test4:34;34", prop4: "test, 66;77, 888"'
  //def regex = /(\w+):"(.*)"[,\s$]/
  //def regex = /(\w+):"(.*)"[,|\s|$]/
  def regex = /(\w+):"(.*)"[,\s]|$/
  def m = (str =~ regex)
  (0..<m.count).each{
    println("${m[it][1]}=${m[it][2]}")
  }

This returns:

prop3=test:1234, test1:23;45, test2:34;34
null=null

What am I missing here?

(Also, is there a way to parse all this with just a single regex pass as opposed to my approach described above..regex first, then split?)


Solution

  • Basee on your give example data, following regex would work:

    \b(\w+):\s*(\"[^\"]*\"|[^,\"]*)
    

    RegEx Demo

    RegEx Demo:

    • \b: Word boundary
    • (\w+): Capture group #1 t match 1+ word characters
    • :: Match a :
    • \s*: 0 or more whitespaces
    • (: Start capture group #2
      • \"[^\"]*\": Match a quoted text
      • |: OR
      • [^,\"]*: Match 0 or more of any char that is not , and "
    • ): End capture group #2