We are trying to process a tsv file. The fools who made it allowed newlines in some columns, which cause issues now. Luckily these column values with the newlines in them are always contained in double quotes so it is possible to replace them (the newline characters) with spaces using regex. We found this post with the perfect solution to do this, as such:
python: re.sub(r'"[^"]*(?:""[^"]*)*"', lambda m: m.group(0).replace("\n", ""), s)
However, in some unfortunate cases, some other fool entered a unpaired double quote (i.e. not closed) into some other data field and this throws the whole thing off.
We thought about it and we believe everything can be solved if we just refine the regex condition so that the 'selected' text (in which new line characters are to be replaced) so that it may not include any tab characters.
In other words: this "[^"]*(?:""[^"]*)*"
needs to exclude substrings that contain a tab character.
How can we do that?
If you do not want to match TAB chars with your regex, you just need to add \t
to the negated character classes:
"[^"\t]*(?:""[^"\t]*)*"
See the regex demo.
Details:
"
- a double quote[^"\t]*
- zero or more chars other than double quotes and tab chars(?:""[^"\t]*)*
- zero or more sequences of
""
- two double quotes[^"\t]*
- zero or more chars other than double quotes and tab chars"
- a double quote.NOTE it might not work as expected by matching between unwanted quotes. Then use your "[^"]*(?:""[^"]*)*"
regex and check for a tab char in the lambda expression, e.g.
re.sub(r'"[^"]*(?:""[^"]*)*"', lambda m: m.group().replace("\n", "") if "\t" not in m.group() else m.group(), s)
Or something similar.