Search code examples
pythonpython-3.xregexpython-re

re pattern to include year of dates


I have some issues with the re pattern to include the year of dates.

Code

import re

text ="May 2020 Musical Portraits September 24 - 25, 2021 Time: 8:00 pm Toledo Museum of Art Peristyle  Romeo & JulietSpecial EventWhenFriday, Mar 23 / 20187:30pmBuy TicketsSunday, Mar 25 / 20182:30pmBuy TicketsWhereSamford University Wright CenterMap & DirectionsArtist"
format_list = ["(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]

all_dates=[]

for pattern in format_list:
    all_dates = re.findall(pattern, text)
    if all_dates == []:
        continue
    else:
        for index,txt in enumerate(all_dates):
            text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', txt)
            all_dates[index] = text
    print(all_dates)

Output

['September 24 - 25, 2021', 'Mar 23 / 20187', 'Mar 25 / 20182']

Desired Output

['September 24 - 25, 2021', 'Mar 23 / 2018', 'Mar 25 / 2018']

Issue

Instead of "…2018", I'm getting "…20187" and "…20182".


Solution

  • This pattern may do the job you need

    (?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}\s+[\d\s\-\\/,]+?\d{4}
    

    Code:

    import re
    
    text ="May 2020 Musical Portraits September 24 - 25, 2021 Time: 8:00 pm Toledo Museum of Art Peristyle  Romeo & JulietSpecial EventWhenFriday, Mar 23 / 20187:30pmBuy TicketsSunday, Mar 25 / 20182:30pmBuy TicketsWhereSamford University Wright CenterMap & DirectionsArtist"
    format_list  = [
        # r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}[\d\s\-\\/,]*?\d{4}",  # If you want to also match e.g. May 2020
        r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}\s+[\d\s\-\\/,]+?\d{4}",
    ]
    
    for pattern in format_list:
        all_dates = re.findall(pattern, text, re.IGNORECASE)
        print(all_dates)
    

    Output:

    ['September 24 - 25, 2021', 'Mar 23 / 2018', 'Mar 25 / 2018']
    

    Where:

    • (?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec) - Match the prefix of the months
    • \w{0,6} - Match the full name of a month optionally, the longest being "sep" (from previous match) + "tember"
    • \s+ - Match 1 or more spaces.
    • [\d\s\-\\/,]+? - Match the days part whether separated by space, dash, or slash.
    • \d{4} - Match the year part.

    Note that since regex is just string-based processing, you would be limited with the format "mon day, year" here. You would need additional patterns to match different possible date formats. You might want to explore date parsers that can scan through text.