I have this code:
import re
x = "John Doe, Aug 5 2020 Hello Jane Doe: Aug 5 2020"
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
rx = re.compile(fr"\s+(?=(?:{'|'.join(months)})\b)", re.I)
print(rx.split(x))
Which outputs this:
['John Doe,', 'Aug 5 2020 Hello Jane Doe:', 'Aug 5 2020']
I would like it to output this:
["John Doe, Aug 5 2020", "Hello Jane Doe: Aug 5 2020"]
How could I do this? Thank you in advance for all the help!
Instead of split
you can use findall
using this approach:
>>> rx = re.compile(fr"\b\S.*?(?:{'|'.join(months)})" + r"\s+\d{1,2}\s+\d{4}", re.I)
>>> print(rx.findall(x))
['John Doe, Aug 5 2020', 'Hello Jane Doe: Aug 5 2020']
In this regex, we start the match from a word boundary and a non-whitespace character and match anything until we find this date string which is an alternation of months followed by date and year part.