Search code examples
regexpython-3.xinformation-extraction

How to write regular expression to extract years


How can we write regular expression to extract years in texts, years may come in the following forms

Case 1:
1970 - 1980 --> 1970, 1980
January 1920 - Feb 1930 --> 1920, 1930
May 1920 to September 1930 --> 1920, 1930
Case 2:
July 1945 --> 1945

Writing regular expression for Case 1 is easy but how can I tackle Case 2 along with it

\d{4} \s? (?: [^a-zA-Z0-9] | to) \s? \w+? \d{4}

Solution

  • Regex: .*?([0-9]{4})(?:.*?([0-9]{4}))? or .*?(\d{4})(?:.*?(\d{4}))?

    Details:

    • () Capturing group
    • (?:) Non capturing group
    • {n} Matches exactly n times
    • .*? Matches any char between zero and unlimited times (lazy)

    Python code:

    def Years(text):
            return re.findall(r'.*?([0-9]{4})(?:.*?([0-9]{4}))?', text)
    
    print(Years('January 1920 - Feb 1930'))
    

    Output:

    [('1920', '1930')]