I am a python beginner.
I have a large txt file in the following format, made of many one sentence paragraphs:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
****
Sed id placerat magna.
*******
Pellentesque in ex ac urna tincidunt tristique.
Etiam dapibus faucibus gravida.
I am trying to get output as only the paragraphs following the asterisks paragraph [ minimum 4 asterisks per asterisks paragraph ].
The output I need:
Sed id placerat magna.
Pellentesque in ex ac urna tincidunt tristique.
I was trying something like this, but I have no idea A] how to set the minimum 4 asterisks per asterisks paragraph and B] how to set the paragraph after the asterisks paragraph.
import re
article_content = [open('text.txt').read() ]
after_asterisk_article_paragraph = []
string = "****"
after_asterisk_article_paragraph = string[string.find("****")+4:]
print(*after_asterisk_article_paragraph, sep='\n\n')
Again, I am just starting Python so please excuse me.
You might read the whole file and use a pattern to match at least 4 times an asterix, followed by all lines that are not empty or start with 4 times an asterix.
^\*{4,}((?:\r?\n(?!\s*$|\*{4}).+)*)
^\*{4,}
Match 4 or more times *
from the start of the string(
Capture group 1
(?:
Non capture group
\r?\n
Match a newline(?!\s*$|\*{4}).+
Match the whole line if it is not empty or starts with 4 times *
using a negative lookahead (?!
)*
Optionally repeat the group)
Close capture group 1For example using re.findall which will return the capture group 1 value:
import re
file = open('text.txt', mode='r')
result = [s.strip() for s in re.findall(r'^\*{4,}((?:\r?\n(?!\s*$|\*{4}).+)*)', file.read(), re.MULTILINE)]
print(result)
file.close()
Output
['Sed id placerat magna.', 'Pellentesque in ex ac urna tincidunt tristique.']