Search code examples
pythonregexdatefindall

Python findall returns unexpected results


I'm new at Python but have to make a regex to pick up dates in format dd-mm-yyyy form text. I wrote something like this:

format1 = re.findall('[0-2][0-9]-02-(\d){4}|(([0-2][0-9]|30)-(04|06|09|11)-(\d){4})|(([0-2][0-9]|30|31)-(01|03|05|07|08|10|12)-(\d){4})',article)

It also checks if date format is correct. I checked if it works at pythex.org I returns the right dates but unfortunately also some empty matches and random numbers:

Match 1
1.  None
2.  None
3.  None
4.  None
5.  None
6.  21-10-2005
7.  21
8.  10
9.  5

Match 2
1.  None
2.  None
3.  None
4.  None
5.  None
6.  31-12-1993
7.  31
8.  12
9.  3

How can I improve the regex to return only dates or drop everything that isn't a date?


Solution

  • It looks to me like you need to make use of non-capturing groups.

    Here's the thing: in a regular expression, anything inside parentheses () is a captured group - it comes out as one of the items captured in a match.

    If you want to use parentheses to group a part of the pattern (e.g. so that you can use | at something lower than the top level), but you don't want the text inside that parenthetical group to be a separate item in the match output, then you want to use a non-capturing group instead.

    To do that, where you would have had (foo), instead use (?:foo) - adding the ?: to the beginning. That prevents that group from capturing text in the final match.