Search code examples
pythonregexsplitsubstring

regex split: ignore delimiter if followed by short substring


I have a csv file in which pipes serve as delimiters. But sometimes a short substring follows the 3rd pipe: up to 2 alphanumeric characters behind it. Then the 3rd pipe should not be interpreted as a delimiter.

example: split on each pipe:

x1 = "as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY"

=> split after XXL because it is followed by more than 2 characters

examples: split on all pipes except the 3rd if there are less than 3 characters between pipes 3 and 4:

x2 = "as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf"

x3 = "as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"

=> keep "1g|z4" and "a1|2" together.

My regex attempts only suffice for a substring replacement like this one: It replaces the pipe with a hyphen if it finds it between 2 digits: 3|4 => 3-4.

x = re.sub(r'(?<=\d)\|(?=\d)', repl='-', string=x1, count=1).

My question is: If after the third pipe follows a short alphanumeric substring no longer than 1 or 2 characters (like Bx, 2, 42, z or 3b), then re.split should ignore the 3rd pipe and continue with the 4th pipe. All other pipes but #3 are unconditional delimiters.


Solution

  • You can use re.sub to add quotechar around the short columns. Then use Python's builtin csv module to parse the text (regex101 of the used expression)

    import re
    import csv
    from io import StringIO
    
    txt = """\
    as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY
    as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf
    as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"""
    
    
    pat = re.compile(r"^((?:[^|]+\|){2})([^|]+\|[^|]{,2}(?=\|))", flags=re.M)
    txt = pat.sub(r'\1"\2"', txt)
    
    reader = csv.reader(StringIO(txt), delimiter="|", quotechar='"')
    for line in reader:
        print(line)
    

    Prints:

    ['as234-HJ123-HG', 'dfdf KHT werg', 'XXL', 's45dtgIKU', '2017-SS0', '123.45', 'asUJY']
    ['as234-H344423-dfX', 'dfer XXYUyu werg', '1g|z4', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']
    ['as234-H3wer23-dZ', 'df3r Xa12yu wg', 'a1|2', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']