Search code examples
pythontextextractparagraph

Get paragraph after a certain symbol in Python


I am a python beginner.

I have a large txt file in the following format, made of many one sentence paragraphs:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

****
Sed id placerat magna.

*******
Pellentesque in ex ac urna tincidunt tristique. 

Etiam dapibus faucibus gravida.

I am trying to get output as only the paragraphs following the asterisks paragraph [ minimum 4 asterisks per asterisks paragraph ].

The output I need:

Sed id placerat magna.

Pellentesque in ex ac urna tincidunt tristique. 

I was trying something like this, but I have no idea A] how to set the minimum 4 asterisks per asterisks paragraph and B] how to set the paragraph after the asterisks paragraph.

import re

article_content = [open('text.txt').read() ]

after_asterisk_article_paragraph = []
 
string = "****"
after_asterisk_article_paragraph = string[string.find("****")+4:]

print(*after_asterisk_article_paragraph, sep='\n\n')

Again, I am just starting Python so please excuse me.


Solution

  • You might read the whole file and use a pattern to match at least 4 times an asterix, followed by all lines that are not empty or start with 4 times an asterix.

    ^\*{4,}((?:\r?\n(?!\s*$|\*{4}).+)*)
    
    • ^\*{4,} Match 4 or more times * from the start of the string
    • ( Capture group 1
      • (?: Non capture group
        • \r?\n Match a newline
        • (?!\s*$|\*{4}).+ Match the whole line if it is not empty or starts with 4 times * using a negative lookahead (?!
      • )* Optionally repeat the group
    • ) Close capture group 1

    Regex demo

    For example using re.findall which will return the capture group 1 value:

    import re
    file = open('text.txt', mode='r')
    result = [s.strip() for s in re.findall(r'^\*{4,}((?:\r?\n(?!\s*$|\*{4}).+)*)', file.read(), re.MULTILINE)]
    print(result)
    file.close()
    

    Output

    ['Sed id placerat magna.', 'Pellentesque in ex ac urna tincidunt tristique.']