I'm new at Python but have to make a regex to pick up dates in format dd-mm-yyyy form text. I wrote something like this:
format1 = re.findall('[0-2][0-9]-02-(\d){4}|(([0-2][0-9]|30)-(04|06|09|11)-(\d){4})|(([0-2][0-9]|30|31)-(01|03|05|07|08|10|12)-(\d){4})',article)
It also checks if date format is correct. I checked if it works at pythex.org I returns the right dates but unfortunately also some empty matches and random numbers:
Match 1
1. None
2. None
3. None
4. None
5. None
6. 21-10-2005
7. 21
8. 10
9. 5
Match 2
1. None
2. None
3. None
4. None
5. None
6. 31-12-1993
7. 31
8. 12
9. 3
How can I improve the regex to return only dates or drop everything that isn't a date?
It looks to me like you need to make use of non-capturing groups.
Here's the thing: in a regular expression, anything inside parentheses ()
is a captured group - it comes out as one of the items captured in a match.
If you want to use parentheses to group a part of the pattern (e.g. so that you can use |
at something lower than the top level), but you don't want the text inside that parenthetical group to be a separate item in the match output, then you want to use a non-capturing group instead.
To do that, where you would have had (foo)
, instead use (?:foo)
- adding the ?:
to the beginning. That prevents that group from capturing text in the final match.