Search code examples
pythonregexurlimgur

Python regex: Matching a URL


I have some confusion regarding the pattern matching in the following expression. I tried to look up online but couldn't find an understandable solution:

imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')

What exactly are the parentheses doing ? I understood up until the first asterisk , but I can't figure out what is happening after that.


Solution

  • (http://i.imgur.com/(.*))(\?.*)?
    

    The first capturing group (http://i.imgur.com/(.*)) means that the string should start with http://i.imgur.com/ followed by any number of characters (.*) (this is a poor regex, you shouldn't do it this way). (.*) is also the second capturing group.

    The third capturing group (\?.*) means that this part of the string must start with ? and then contain any number of any characters, as above.

    The last ? means that the last capturing group is optional.

    EDIT: These groups can then be used as:

    p = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
    m = p.match('ab')
    m.group(0);
    m.group(2);
    

    To improve the regex, you must limit the engine to what characters you need, like:

    (http://i.imgur.com/([A-z0-9\-]+))(\?[[^/]+*)?
    

    [A-z0-9\-]+ limit to alphanumeric characters
    [^/] exclude /