def find_valid_dates(dt):
result = re.findall("\d{1,2}-\d{2}-\d{2,4}|\d{1,2} (?:januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december) \d{1,4}", dt)
# result = re.findall("\d{2}-\d{2}-\d{4}|[a-zA-Z]+\s+\d{4}",dt)
return result
SaaOne_msi_vervangen['valid_dates']=SaaOne_msi_vervangen['Oplossingstekst'].apply(lambda dt : find_valid_dates(dt))
The column "Oplossingstekst" of my dataframe SaaOne_msi_vervangen contains multiple dates in different format. For example: 14-06-2020 and 2 oktober 2023. I tried to extract both dates using the or operator in my findall, but thus far this code doesn't extract 2 oktober 2023. It is maybe related to the white spaces. How can I solve this?
I would personally replace the space " "
by \s
or \s+
. This way, you
can match all kind of spaces (and new lines). But you could be more restrictive
and replace it by horizontal whitespace chars = \h
(seems not available in Python, but equivalent to
[\t\x{00A0}\x{1680}\x{180E}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}\x{2006}\x{2007}\x{2008}\x{2009}\x{200A}\x{202F}\x{205F}\x{3000} ]
). The list could be reduced. Up to you to decide if you match it once or more than once.
As you are probably having to parse the date later, let's capture the day, month and year in some named capturing groups. I would suggest this:
regex = r"""
\b # word boundary
(?: # non-capturing group for the "or"
# Short notation: 14-06-2022, 1-05-23
(?P<short>
(?P<short_day>\d{1,2})
-
(?P<short_month>\d{2})
-
(?P<short_year>\d{2}|\d{4})
)
| # Or
# Text notation: 2 oktober 2023, 31 december 23
(?P<text>
(?P<text_day>\d{1,2}) # day
\s+ # white spaces
(?P<text_month>
januari|februari|maart|april|mei|juni|juli|
augustus|september|oktober|november|december
)
\s+ # white spaces
(?P<text_year>\d{2}|\d{4}) # year with 2 or 4 digits, but not 3.
)
)
\b # word boundary
"""
matches = re.finditer(regex, test_str, re.VERBOSE | re.IGNORECASE)
I used these flags:
x = re.VERBOSE
. The extended/verbose flag lets you put some
comments in your regex.
i = re.IGNORECASE
For the year, I think that \d{2,4}
isn't the best, as it would match 3 digits, not really a valid year value. I replaced it by \d{2}|\d{4}
.
I also added the word boundaries \b
around to avoid matching a part of
"1-06-123456" which could be a product id or whatever else.
You can play with this regex101 and use the Code Generator to test the Python code.