Search code examples
pythonregexstringdatefilenames

Regular expression match last occurence of year in string


I have written a python script with the following function, which takes as input a file name that contains multiple dates.

CODE

import re
from datetime import datetime

def ExtractReleaseYear(title):
    rg = re.compile('.*?([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
    match = rg.search(title) # Using non-greedy match on filler
    if match:
        releaseYear = match.group(1)
        try:
            if int(releaseYear) >= 1900 and int(releaseYear) <= int(datetime.now().year) and int(releaseYear) <= 2099: # Film between 1900-2099
                return releaseYear
        except ValueError:
            print("ERROR: The film year in the file name could not be converted to an integer for comparison.")
            return ""

print(ExtractReleaseYear('2012.(2009).3D.1080p.BRRip.SBS.x264'))
print(ExtractReleaseYear('Into.The.Storm.2012.1080p.WEB-DL.AAC2.0.H264'))
print(ExtractReleaseYear('2001.A.Space.Odyssey.1968.1080p.WEB-DL.AAC2.0.H264'))

OUTPUT

Returned: 2012 -- I'd like this to be 2009 (i.e. last occurrence of year in string)

Returned: 2012 -- This is correct! (last occurrence of year is the first one, thus right)

Returned: 2001 -- I'd like this to be 1968 (i.e. last occurrence of year in string)

ISSUE

As can be observed, the regex will only target the first occurrence of a year instead of the last. This is problematic because some titles (such as the ones included here) begin with a year.

Having searched for ways to get the last occurrence of the year has led me to this resources like negative lookahead, last occurrence of repeated group and last 4 digits in URL, none of which have gotten me any closer to achieving the desired result. No existing question currently answers this unique case.

INTENDED OUTCOME

  • I would like to extract the LAST occurrence (instead of the first) of a year from the given file name and return it using the existing definition/function as stated in the output quote above. While I have used online regex references, I am new to regex and would appreciate someone showing me how to implement this filter to work on the file names above. Cheers guys.

Solution

  • There are two things you need to change:

    1. The first .*? lazy pattern must be turned to greedy .* (in this case, the subpatterns after .* will match the last occurrence in the string)
    2. The group you need to use is Group 2, not Group 1 (as it is the group that stores the year data). Or make the first capturing group non-capturing.

    See this demo:

    rg = re.compile('.*([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
    ...
    releaseYear = match.group(2)
    

    Or:

    rg = re.compile('.*(?:[\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
    ...
    releaseYear = match.group(1)