Goal: Break text into a list based on a numeric or decimal match that retrieves all text up until, but not including the next match. Language/version: Python 3.8.5 using python re.findall() and I'm open to alternate suggestions.
Text example (yes, it's all on one line):
1 Something Interesting here 2 More interesting text 2.1 An example of 2C19 a header 2.3 Another header example 2.4 another interesting header 10.1 header stuff 14 the last interesting 3A4 header
Goal Output:
['1 Something Interesting here',
'2 More interesting text',
'2.1 An example of 2C19 a header',
'2.3 Another header example',
'2.4 another interesting header',
'10.1 header stuff',
'14 the last interesting 3A4 header'
]
I can identify most of the appropriate integer/decimal starting points using:
(\d+\.\d+)|([^a-zA-Z]\d\d)|( \d )
I'm struggling to find a way to return the text between the matches plus the match itself.
To save you some time, here's my Regex sandbox
Thank you kindly
You can use positive lookahead expressions to match until the next match.
Here is the updated regex (sandbox):
\b(?:\d+(?:\.\d+)?)\b.*?(?=\b(?:\d+(?:\.\d+)?)\b|$)
In python:
regex = r'\b(?:\d+(?:\.\d+)?)\b.*?(?=\b(?:\d+(?:\.\d+)?)\b|$)'
string = ' 1 Something Interesting here 2 More interesting text 2.1 An example of 2C19 a header 2.3 Another header example 2.4 another interesting header 10.1 header stuff 14 the last interesting 3A4 header'
result = re.findall(regex, string)
In this case, result
will be:
>>> result
['1 Something Interesting here ',
'2 More interesting text ',
'2.1 An example of 2C19 a header ',
'2.3 Another header example ',
'2.4 another interesting header ',
'10.1 header stuff ',
'14 the last interesting 3A4 header']
Note that this solution also extracts the spacing at the end. If you don't want this spacing, you can call strip
on your strings:
>>> [ match.strip() for match in result ]
['1 Something Interesting here',
'2 More interesting text',
'2.1 An example of 2C19 a header',
'2.3 Another header example',
'2.4 another interesting header',
'10.1 header stuff',
'14 the last interesting 3A4 header']