Search code examples
pythonregexpython-re

Regex return match plus string up until next match


Goal: Break text into a list based on a numeric or decimal match that retrieves all text up until, but not including the next match. Language/version: Python 3.8.5 using python re.findall() and I'm open to alternate suggestions.

Text example (yes, it's all on one line):

 1 Something Interesting here 2 More interesting text 2.1 An example of 2C19 a header 2.3 Another header example 2.4 another interesting header 10.1 header stuff  14 the last interesting 3A4 header

Goal Output:

['1 Something Interesting here',
'2 More interesting text',
'2.1 An example of 2C19 a header',
'2.3 Another header example',
'2.4 another interesting header',
'10.1 header stuff',
'14 the last interesting 3A4 header'
]

I can identify most of the appropriate integer/decimal starting points using:

(\d+\.\d+)|([^a-zA-Z]\d\d)|( \d )

I'm struggling to find a way to return the text between the matches plus the match itself.

To save you some time, here's my Regex sandbox

Thank you kindly


Solution

  • You can use positive lookahead expressions to match until the next match.

    Here is the updated regex (sandbox):

    \b(?:\d+(?:\.\d+)?)\b.*?(?=\b(?:\d+(?:\.\d+)?)\b|$)

    In python:

    regex = r'\b(?:\d+(?:\.\d+)?)\b.*?(?=\b(?:\d+(?:\.\d+)?)\b|$)'
    string = ' 1 Something Interesting here 2 More interesting text 2.1 An example of 2C19 a header 2.3 Another header example 2.4 another interesting header 10.1 header stuff  14 the last interesting 3A4 header'
    result = re.findall(regex, string)
    

    In this case, result will be:

    >>> result
    ['1 Something Interesting here ',
     '2 More interesting text ',
     '2.1 An example of 2C19 a header ',
     '2.3 Another header example ',
     '2.4 another interesting header ',
     '10.1 header stuff  ',
     '14 the last interesting 3A4 header']
    

    Note that this solution also extracts the spacing at the end. If you don't want this spacing, you can call strip on your strings:

    >>> [ match.strip() for match in result ]
    ['1 Something Interesting here',
     '2 More interesting text',
     '2.1 An example of 2C19 a header',
     '2.3 Another header example',
     '2.4 another interesting header',
     '10.1 header stuff',
     '14 the last interesting 3A4 header']