I have a csv file in which pipes serve as delimiters. But sometimes a short substring follows the 3rd pipe: up to 2 alphanumeric characters behind it. Then the 3rd pipe should not be interpreted as a delimiter.
example: split on each pipe:
x1 = "as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY"
=> split after XXL because it is followed by more than 2 characters
examples: split on all pipes except the 3rd if there are less than 3 characters between pipes 3 and 4:
x2 = "as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf"
x3 = "as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"
=> keep "1g|z4" and "a1|2" together.
My regex attempts only suffice for a substring replacement like this one: It replaces the pipe with a hyphen if it finds it between 2 digits: 3|4 => 3-4.
x = re.sub(r'(?<=\d)\|(?=\d)', repl='-', string=x1, count=1).
My question is: If after the third pipe follows a short alphanumeric substring no longer than 1 or 2 characters (like Bx, 2, 42, z or 3b), then re.split should ignore the 3rd pipe and continue with the 4th pipe. All other pipes but #3 are unconditional delimiters.
You can use re.sub
to add quotechar around the short columns. Then use Python's builtin csv
module to parse the text (regex101 of the used expression)
import re
import csv
from io import StringIO
txt = """\
as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY
as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf
as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"""
pat = re.compile(r"^((?:[^|]+\|){2})([^|]+\|[^|]{,2}(?=\|))", flags=re.M)
txt = pat.sub(r'\1"\2"', txt)
reader = csv.reader(StringIO(txt), delimiter="|", quotechar='"')
for line in reader:
print(line)
Prints:
['as234-HJ123-HG', 'dfdf KHT werg', 'XXL', 's45dtgIKU', '2017-SS0', '123.45', 'asUJY']
['as234-H344423-dfX', 'dfer XXYUyu werg', '1g|z4', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']
['as234-H3wer23-dZ', 'df3r Xa12yu wg', 'a1|2', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']