Search code examples
pythonregextexttext-mining

Python regex - Extract all the matching text between two patterns


I want to extract all the text in the bullet points numbered as 1.1, 1.2, 1.3 etc. Sometimes the bullet points can have space like 1. 1, 1. 2, 1 .3, 1 . 4

Sample text

    text = "some text before pattern 1.1 text_1_here  1.2 text_2_here  1 . 3 text_3_here  1. 4 text_4_here  1 .5 text_5_here 1.10 last_text_here 1.23 text after pattern"

For the text above, the output should be [' text_1_here ', ' text_2_here ', ' text_3_here ', ' text_4_here ', ' text_5_here ', ' last_text_here ']

I tried regex findall but not getting the required output. It is able to identify and extract 1.1 & 1.2 and then 1.3 & 1.4. It is skipping text between 1.2 & 1.3.

    import re
    re.findall(r'[0-9].\s?[0-9]+(.*?)[0-9].\s?[0-9]+', text)

Solution

  • I'm unsure about the exact rule why you'd want to exclude the last bit of text but based on your comments it seems we could also just split the entire text on the bullits and simply exclude the 1st and last element from the resulting array:

    re.split(r'\s+\d(?:\s*\.\s*\d+)+\s+', text)[1:-1]
    

    Which would output:

    ['text_1_here', 'text_2_here', 'text_3_here', 'text_4_here', 'text_5_here', 'last_text_here']