Search code examples
pythonregexdataframedate-format

How to extract date from a specified column containing different types of date formats of a given Pandas DataFrame using Regex


def find_valid_dates(dt):
    result = re.findall("\d{1,2}-\d{2}-\d{2,4}|\d{1,2} (?:januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december) \d{1,4}", dt)
    # result = re.findall("\d{2}-\d{2}-\d{4}|[a-zA-Z]+\s+\d{4}",dt)
    return result

SaaOne_msi_vervangen['valid_dates']=SaaOne_msi_vervangen['Oplossingstekst'].apply(lambda dt : find_valid_dates(dt))

The column "Oplossingstekst" of my dataframe SaaOne_msi_vervangen contains multiple dates in different format. For example: 14-06-2020 and 2 oktober 2023. I tried to extract both dates using the or operator in my findall, but thus far this code doesn't extract 2 oktober 2023. It is maybe related to the white spaces. How can I solve this?


Solution

  • The space problem you may have

    I would personally replace the space " " by \s or \s+. This way, you can match all kind of spaces (and new lines). But you could be more restrictive and replace it by horizontal whitespace chars = \h (seems not available in Python, but equivalent to [\t\x{00A0}\x{1680}\x{180E}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}\x{2006}\x{2007}\x{2008}\x{2009}\x{200A}\x{202F}\x{205F}\x{3000} ]). The list could be reduced. Up to you to decide if you match it once or more than once.

    A regex with capturing groups

    As you are probably having to parse the date later, let's capture the day, month and year in some named capturing groups. I would suggest this:

    regex = r"""
        \b # word boundary
        (?: # non-capturing group for the "or"
          # Short notation: 14-06-2022, 1-05-23
          (?P<short>
            (?P<short_day>\d{1,2})
            -
            (?P<short_month>\d{2})
            -
            (?P<short_year>\d{2}|\d{4})
          )
          | # Or
          # Text notation: 2 oktober 2023, 31 december 23
          (?P<text>
            (?P<text_day>\d{1,2}) # day
            \s+ # white spaces
            (?P<text_month>
              januari|februari|maart|april|mei|juni|juli|
              augustus|september|oktober|november|december
            )
            \s+ # white spaces
            (?P<text_year>\d{2}|\d{4}) # year with 2 or 4 digits, but not 3.
          )
        )
        \b # word boundary
        """
    matches = re.finditer(regex, test_str, re.VERBOSE | re.IGNORECASE)
    

    I used these flags:

    • x = re.VERBOSE. The extended/verbose flag lets you put some comments in your regex.

    • i = re.IGNORECASE

    For the year, I think that \d{2,4} isn't the best, as it would match 3 digits, not really a valid year value. I replaced it by \d{2}|\d{4}.

    I also added the word boundaries \b around to avoid matching a part of "1-06-123456" which could be a product id or whatever else.

    You can play with this regex101 and use the Code Generator to test the Python code.