Search code examples
regexcsvasp-classicsplitquotes

Split a CSV where some entries have double quotes


I'm trying to split a CSV file into an array, but encountering two problems: quoted commas and empty elements.

I tried numerous possibilities. I've not found a a regex that works 100%.

The CSV looks like:

123,2.99,AMO024,Title,"Description, more info",,123987564

The regex I tried:

thisLine.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)

In my output array the fifth element comes out as 123987564 not an empty string.


Solution

  • Description

    Instead of using a split, I think it would be easier to simply execute a match and process all the found matches.

    This expression will:

    • divide your sample text on the comma delimits
    • will process empty values
    • will ignore double quoted commas, providing double quotes are not nested
    • trims the delimiting comma from the returned value
    • trims surrounding quotes from the returned value
    • if the string starts with a comma, then the first capture group will return a null value

    Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

    enter image description here

    Example

    Sample Text

    123,2.99,AMO024,Title,"Description, more info",,123987564
    

    ASP example using the non-java expression

    Set regEx = New RegExp
    regEx.Global = True
    regEx.IgnoreCase = True
    regEx.MultiLine = True
    sourcestring = "your source string"
    regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
    Set Matches = regEx.Execute(sourcestring)
      For z = 0 to Matches.Count-1
        results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
        For zz = 0 to Matches(z).SubMatches.Count-1
          results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
        next
        results=Left(results,Len(results)-1) & chr(13)
      next
    Response.Write "<pre>" & results
    

    Matches using the non-java expression

    Group 0 gets the entire substring which includes the comma
    Group 1 gets the quote if it's used
    Group 2 gets the value not including the comma

    [0][0] = 123
    [0][1] = 
    [0][2] = 123
    
    [1][0] = ,2.99
    [1][1] = 
    [1][2] = 2.99
    
    [2][0] = ,AMO024
    [2][1] = 
    [2][2] = AMO024
    
    [3][0] = ,Title
    [3][1] = 
    [3][2] = Title
    
    [4][0] = ,"Description, more info"
    [4][1] = "
    [4][2] = Description, more info
    
    [5][0] = ,
    [5][1] = 
    [5][2] = 
    
    [6][0] = ,123987564
    [6][1] = 
    [6][2] = 123987564
    

    Edited

    As Boris pointed out CSV format will escape a double quote " as a double double quote "". Although this requirement wasn't included by the OP, if your text includes double double quotes then you'll want to use a this modified expression:

    Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)(?:[^"]|"")*|[^,"]*))"?(?=,|$)

    See also: https://regex101.com/r/y8Ayag/1

    It should also be pointed out that Regex is a pattern matching tool not a parsing engine. Therefore if your text includes double double quotes it will still contain the double double quotes after pattern matching is completed. With this solution you'd still need to search for the double double quotes and replace them in your captured text.