I am trying to sort out specific paragraph by using regular expression in python.
here is an input.txt file.
some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff
paragraph_a A_story(
...
some random texts adfsasdsd
...
)
paragraph_b different_story(
...
some random texts
...
)
expected output is here:
some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff
paragraph_b different_story(
...
some random texts
...
)
What I want to do is to delete all the paragraph_a contents (including parenthesis) but It should be deleted by the name of the below-line paragraph(in this case, paragraph_b) because the contents of the to-be-deleted paragraph(in this case, paragraph_a) is random.
I've managed to make regular expression to select Only the paragraph that is located right above paragraph_b
https://regex101.com/r/pwGVbe/1 <- you can refer to it in here.
However, By using this regular expression I couldn't delete the thing I want.
here is what I've done so far:
import re
output = open ('output.txt', 'w')
input = open('input.txt', 'r')
for line in input:
# print(line)
t = re.sub('^(\w+ \w+\((?:(.|\n)*)\))\s*^paragraph_b','', line)
output.write(t)
Is there anything I can get some solution or clue? Any answer or advice would be appreciated.
Thanks.
You can match the paragraph before by asserting paragraph_b and not cross more paragraphs.
Note that input
is a reserved keyword, so instead of writing input = open('input.txt', 'r')
you might write it like this input_file = open('file', 'r')
^\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)
If the match also should not start with paragraph_b itself:
^(?!paragraph_b)\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)
Example, using input_file.read()
to read the whole file:
import re
output_file = open('file_out', 'w')
input_file = open('file', 'r')
t = re.sub(
'^(?!paragraph_b)\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)',
'',
input_file.read(),
0,
re.M
)
output_file.write(t)
Contents of output.txt
some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff
paragraph_b different_story(
...
some random texts
...
)