Search code examples
pythonpython-3.xregexpython-repython-pattern

replace specific pattern by using regular expression in python


I am trying to sort out specific paragraph by using regular expression in python.

here is an input.txt file.

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    
paragraph_a A_story(

...
some random texts adfsasdsd

...
)

paragraph_b different_story(
...
some random texts
...
)

expected output is here:

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    

paragraph_b different_story(
...
some random texts
...
)

What I want to do is to delete all the paragraph_a contents (including parenthesis) but It should be deleted by the name of the below-line paragraph(in this case, paragraph_b) because the contents of the to-be-deleted paragraph(in this case, paragraph_a) is random.

I've managed to make regular expression to select Only the paragraph that is located right above paragraph_b

https://regex101.com/r/pwGVbe/1 <- you can refer to it in here.

However, By using this regular expression I couldn't delete the thing I want.

here is what I've done so far:

import re

output = open ('output.txt', 'w')
input = open('input.txt', 'r')

for line in input:
#    print(line)
    t = re.sub('^(\w+ \w+\((?:(.|\n)*)\))\s*^paragraph_b','', line)
    output.write(t)

Is there anything I can get some solution or clue? Any answer or advice would be appreciated.

Thanks.


Solution

  • You can match the paragraph before by asserting paragraph_b and not cross more paragraphs.

    Note that input is a reserved keyword, so instead of writing input = open('input.txt', 'r') you might write it like this input_file = open('file', 'r')

     ^\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)
    

    Regex demo

    If the match also should not start with paragraph_b itself:

    ^(?!paragraph_b)\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)
    

    Regex demo

    Example, using input_file.read() to read the whole file:

    import re
    
    output_file = open('file_out', 'w')
    input_file = open('file', 'r')
    
    t = re.sub(
        '^(?!paragraph_b)\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)',
        '',
        input_file.read(),
        0,
        re.M
    )
    output_file.write(t)
    

    Contents of output.txt

    some random texts (100+ lines)
    bbb
    ...
    ttt
    some random texts
    ccc
    ...
    fff    
    
    
    paragraph_b different_story(
    ...
    some random texts
    ...
    )