python python-3.x regex python-re python-pattern

replace specific pattern by using regular expression in python

I am trying to sort out specific paragraph by using regular expression in python.

here is an input.txt file.

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    
paragraph_a A_story(

...
some random texts adfsasdsd

...
)

paragraph_b different_story(
...
some random texts
...
)

expected output is here:

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    

paragraph_b different_story(
...
some random texts
...
)

What I want to do is to delete all the paragraph_a contents (including parenthesis) but It should be deleted by the name of the below-line paragraph(in this case, paragraph_b) because the contents of the to-be-deleted paragraph(in this case, paragraph_a) is random.

I've managed to make regular expression to select Only the paragraph that is located right above paragraph_b

https://regex101.com/r/pwGVbe/1 <- you can refer to it in here.

However, By using this regular expression I couldn't delete the thing I want.

here is what I've done so far:

import re

output = open ('output.txt', 'w')
input = open('input.txt', 'r')

for line in input:
#    print(line)
    t = re.sub('^(\w+ \w+\((?:(.|\n)*)\))\s*^paragraph_b','', line)
    output.write(t)

Is there anything I can get some solution or clue? Any answer or advice would be appreciated.

Thanks.

Solution

You can match the paragraph before by asserting paragraph_b and not cross more paragraphs.

Note that input is a reserved keyword, so instead of writing input = open('input.txt', 'r') you might write it like this input_file = open('file', 'r')

 ^\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)

Regex demo

If the match also should not start with paragraph_b itself:

^(?!paragraph_b)\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)

Regex demo

Example, using input_file.read() to read the whole file:

import re

output_file = open('file_out', 'w')
input_file = open('file', 'r')

t = re.sub(
    '^(?!paragraph_b)\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)',
    '',
    input_file.read(),
    0,
    re.M
)
output_file.write(t)

Contents of output.txt

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    


paragraph_b different_story(
...
some random texts
...
)