How to extract date from a specified column containing different types of date formats of a given Pandas DataFrame using Regex

def find_valid_dates(dt):
    result = re.findall("\d{1,2}-\d{2}-\d{2,4}|\d{1,2} (?:januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december) \d{1,4}", dt)
    # result = re.findall("\d{2}-\d{2}-\d{4}|[a-zA-Z]+\s+\d{4}",dt)
    return result

SaaOne_msi_vervangen['valid_dates']=SaaOne_msi_vervangen['Oplossingstekst'].apply(lambda dt : find_valid_dates(dt))

The column "Oplossingstekst" of my dataframe SaaOne_msi_vervangen contains multiple dates in different format. For example: 14-06-2020 and 2 oktober 2023. I tried to extract both dates using the or operator in my findall, but thus far this code doesn't extract 2 oktober 2023. It is maybe related to the white spaces. How can I solve this?

Solution

The space problem you may have

I would personally replace the space " " by \s or \s+. This way, you can match all kind of spaces (and new lines). But you could be more restrictive and replace it by horizontal whitespace chars = \h (seems not available in Python, but equivalent to [\t\x{00A0}\x{1680}\x{180E}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}\x{2006}\x{2007}\x{2008}\x{2009}\x{200A}\x{202F}\x{205F}\x{3000} ]). The list could be reduced. Up to you to decide if you match it once or more than once.

A regex with capturing groups

As you are probably having to parse the date later, let's capture the day, month and year in some named capturing groups. I would suggest this:

regex = r"""
    \b # word boundary
    (?: # non-capturing group for the "or"
      # Short notation: 14-06-2022, 1-05-23
      (?P<short>
        (?P<short_day>\d{1,2})
        -
        (?P<short_month>\d{2})
        -
        (?P<short_year>\d{2}|\d{4})
      )
      | # Or
      # Text notation: 2 oktober 2023, 31 december 23
      (?P<text>
        (?P<text_day>\d{1,2}) # day
        \s+ # white spaces
        (?P<text_month>
          januari|februari|maart|april|mei|juni|juli|
          augustus|september|oktober|november|december
        )
        \s+ # white spaces
        (?P<text_year>\d{2}|\d{4}) # year with 2 or 4 digits, but not 3.
      )
    )
    \b # word boundary
    """
matches = re.finditer(regex, test_str, re.VERBOSE | re.IGNORECASE)

I used these flags:

x = re.VERBOSE. The extended/verbose flag lets you put some comments in your regex.
i = re.IGNORECASE

For the year, I think that \d{2,4} isn't the best, as it would match 3 digits, not really a valid year value. I replaced it by \d{2}|\d{4}.

I also added the word boundaries \b around to avoid matching a part of "1-06-123456" which could be a product id or whatever else.

You can play with this regex101 and use the Code Generator to test the Python code.