Search code examples
python-3.xsplitpython-re

regex in python: Can you filter string by deliminator with exceptions?


I am trying to parse a long string of 'objects' enclosed by quotes delimitated by commas. EX:

s='"12345","X","description of x","X,Y",,,"345355"'

output=['"12345"','"X"','"description of x"','"X,Y"','','','"345355"']

I am using split to delimitate by commas:


s=["12345","X","description of x","X,Y",,,"345355"]
s.split(',')

This almost works but the output for the string segment ...,"X,Y",... ends up parsing the data enclosed by quotes to "X and Y". I need the split to ignore commas inside of quotes

Split_Output

Is there a way I can delaminate by commas except for in quotes?

I tried using a regex but it ignores the ...,,,... in data because there are no quotes for blank data in the file I'm parsing. I am not an expert with regex and this sample I used from Python split string on quotes. I do understand what this example is doing and not sure how I could modify it to allow parse data that is not enclosed by quotes.

Thanks!

Regex_Output


Solution

  • this should work:

    In [1]: import re
    
    In [2]: s = '"12345","X","description of x","X,Y",,,"345355"'
    
    In [3]: pattern = r"(?<=[\",]),(?=[\",])"
    
    In [4]: re.split(pattern, s)
    Out[4]: ['"12345"', '"X"', '"description of x"', '"X,Y"', '', '', '"345355"']
    

    Explanation:

    • (?<=...) is a "positive lookbehind assertion". It causes your pattern (in this case, just a comma, ",") to match commas in the string only if they are preceded by the pattern given by .... Here, ... is [\",], which means "either a quotation mark or a comma".
    • (?=...) is a "positive lookahead assertion". It causes your pattern to match commas in the string only if they are followed by the pattern specified as ... (again, [\",]: either a quotation mark or a comma).
    • Since both of these assertions must be satisfied for the pattern to match, it will still work correctly if any of your 'objects' begin or end with commas as well.