I have some issues with the re pattern to include the year of dates.
import re
text ="May 2020 Musical Portraits September 24 - 25, 2021 Time: 8:00 pm Toledo Museum of Art Peristyle Romeo & JulietSpecial EventWhenFriday, Mar 23 / 20187:30pmBuy TicketsSunday, Mar 25 / 20182:30pmBuy TicketsWhereSamford University Wright CenterMap & DirectionsArtist"
format_list = ["(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]
all_dates=[]
for pattern in format_list:
all_dates = re.findall(pattern, text)
if all_dates == []:
continue
else:
for index,txt in enumerate(all_dates):
text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', txt)
all_dates[index] = text
print(all_dates)
['September 24 - 25, 2021', 'Mar 23 / 20187', 'Mar 25 / 20182']
['September 24 - 25, 2021', 'Mar 23 / 2018', 'Mar 25 / 2018']
Instead of "…2018"
, I'm getting "…20187"
and "…20182"
.
This pattern may do the job you need
(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}\s+[\d\s\-\\/,]+?\d{4}
Code:
import re
text ="May 2020 Musical Portraits September 24 - 25, 2021 Time: 8:00 pm Toledo Museum of Art Peristyle Romeo & JulietSpecial EventWhenFriday, Mar 23 / 20187:30pmBuy TicketsSunday, Mar 25 / 20182:30pmBuy TicketsWhereSamford University Wright CenterMap & DirectionsArtist"
format_list = [
# r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}[\d\s\-\\/,]*?\d{4}", # If you want to also match e.g. May 2020
r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}\s+[\d\s\-\\/,]+?\d{4}",
]
for pattern in format_list:
all_dates = re.findall(pattern, text, re.IGNORECASE)
print(all_dates)
Output:
['September 24 - 25, 2021', 'Mar 23 / 2018', 'Mar 25 / 2018']
Where:
(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)
- Match the prefix of the months\w{0,6}
- Match the full name of a month optionally, the longest being "sep" (from previous match) + "tember"\s+
- Match 1 or more spaces.[\d\s\-\\/,]+?
- Match the days part whether separated by space, dash, or slash.\d{4}
- Match the year part.Note that since regex is just string-based processing, you would be limited with the format "mon day, year"
here. You would need additional patterns to match different possible date formats. You might want to explore date parsers that can scan through text.