Search code examples
pythonregexlinenewline

How to complete remove a whole line with multiline regular expressions?


I want to remove all lines that include a b in this multiline string:

aba\n
aaa\n
aba\n
aaa\n
aba[\n\n - optional]

Note the file is not necessarily terminated by a newline character, or may have extra line breaks at the end that I want to keep.

This is the expected output:

aaa\n
aaa[\n\n - as in the input file]

This is what I have tried:

import re
String = "aba\naaa\naba\naaa\naba"
print(String)
print(re.sub(".*b.*", "", String))  # this one leaves three empty lines
print(re.sub(".*b.*\n", "", String))  # this one misses the last line
print(re.sub("\n.*b.*", "", String))  # this one misses the first line
print(re.sub(".*b.*\n?", "", String))  # this one leaves an empty last line
print(re.sub("\n?.*b.*", "", String))  # this one leaves an empty first line
print(re.sub("\n?.*b.*\n?", "", String))  # this one joins the two remaining lines

I have also tried out flags=re.M and various look-aheads and -behinds, but the main question seems to be: how can I remove either the first or the last occurrence of \n in a matching string, depending on which on exists - but not both, if both do exist?


Solution

  • There are three cases to take into account in your re.sub() call to remove lines with a b in them:

    1. patterns followed by an end of line character (eol)
    2. the last line in the text (without a trailing eol)
    3. when there is only one line with no trailing eol

    In that second case, you want to remove the preceding eol character to avoid creating an empty line. The third case will produce an empty string if there is a "b".

    Regular expressions' greed will introduce a fourth case because there can't be any pattern overlaps. If your last line contains a "b" and the line before that also contained a "b", case #1 will have consumed the eol character on the previous line so it won't be eligible to detect the pattern on the last line (i.e eol followed by the pattern at the end of text). This can be addressed by clearing (case#1) consecutive matching lines as a group and including the last line as an optional component of that group. Whatever this leaves out will be trailing lines (case#2) where you want to remove the preceding eol rather than the following one.

    In order to manage repetition of the line pattern .*b.* you will need to assemble your search pattern from two parts: The line pattern and the list pattern that uses it multiple times. Since we're already deep in regular expressions, why not use re.sub() to do that as well.

    import re
    
    LinePattern = "(.*b.*)"
    ListPattern = "(Line\n)+(Line$)?|(\nLine$)|(^Line$)" # Case1|Case2|Case3
    Pattern     = re.sub("Line",LinePattern,ListPattern)
    
    String  = "aba\naaa\naba\naaa\naba"
    cleaned = re.sub(Pattern,"",String)
    

    Note: This technique would also work with a different separation character (e.g. comma instead of eol) but the character needs to be excluded from the line pattern (e.g. ([^,]*b[^,]*) )