I have some confusion regarding the pattern matching in the following expression. I tried to look up online but couldn't find an understandable solution:
imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
What exactly are the parentheses doing ? I understood up until the first asterisk , but I can't figure out what is happening after that.
(http://i.imgur.com/(.*))(\?.*)?
The first capturing group (http://i.imgur.com/(.*))
means that the string should start with http://i.imgur.com/ followed by any number of characters (.*
) (this is a poor regex, you shouldn't do it this way). (.*)
is also the second capturing group.
The third capturing group (\?.*)
means that this part of the string must start with ?
and then contain any number of any characters, as above.
The last ?
means that the last capturing group is optional.
EDIT: These groups can then be used as:
p = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
m = p.match('ab')
m.group(0);
m.group(2);
To improve the regex, you must limit the engine to what characters you need, like:
(http://i.imgur.com/([A-z0-9\-]+))(\?[[^/]+*)?
[A-z0-9\-]+
limit to alphanumeric characters
[^/]
exclude/