Search code examples
regexcsvdata-transfer

Regex split extended CSV notation


I have a custom transport format that packages data up in the following format

[a:000,"name","field","field","field"]

I'm trying to split the individual lines out to get the first character after the left bracket and all the CSV values. a, 000, "name", "field", "field" etc...

I cobbled together

[^?,:\[\]]

This splits all the individual characters out not the colon/comma delimited fields. I understand this won't accommodate commas within quotes.So it's clearly rubbish!

Embedded commas isn't really a huge issue as we're in control of the data at both ends so I could just escape them.

Thanks for any insight!


Solution

  • Instead of trying to split on multiple characters and ignore some of them, try to match whatever you want to match. Since you didn't specify the implementation language I am posting this for Perl but you could apply it to any flavor which supports lookbehind and lookaheads.

    while ($subject =~ m/(\w+(?=:)|(?<=:)\d+|(?<=,")[^"]*?(?="))/g) {
        # matched text = $&
    }
    

    Explanation:

    # (\w+(?=:)|(?<=:)\d+|(?<=,")[^"]*?(?="))
    # 
    # Match the regular expression below and capture its match into backreference number 1 «(\w+(?=:)|(?<=:)\d+|(?<=,")[^"]*?(?="))»
    # Match either the regular expression below (attempting the next alternative only if this one fails) «\w+(?=:)»
    # Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
    # Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
    # Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=:)»
    # Match the character “:” literally «:»
    # Or match regular expression number 2 below (attempting the next alternative only if this one fails) «(?<=:)\d+»
    # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=:)»
    # Match the character “:” literally «:»
    # Match a single digit 0..9 «\d+»
    # Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
    # Or match regular expression number 3 below (the entire group fails if this one fails to match) «(?<=,")[^"]*?(?=")»
    # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=,")»
    # Match the characters “,"” literally «,"»
    # Match any character that is NOT a “"” «[^"]*?»
    # Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
    # Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=")»
    # Match the character “"” literally «"»
    

    See it working.