Search code examples
pythonregexpython-re

regex to find valide URL with or without www, including dot but excluding double dots


I am trying to find a regex that matches URLs that include or not 'www', is followed by valide strings that can indlude dots, but not two or more consecutive dots. For sake of simplicity, I am limiting the problem only to URLs with subdomains and with .com domain. For example:

www.aBC.com      #MATCH
abc.com          #MATCH
a_bc.de8f.com    #MATCH
a.com            #MATCH
abc              #NO MATCH
abc..com         #NO MATCH

The closest I got with my regex is \w+.[\w]+.com but this does not match a simple "a.com". I am using "\w" instead of "." because otherwise I don't know how to avoid two or more dots in sequence.

Any help is appreciated.


Solution

  • Use

    (?:\w+\.)*\w+\.com
    

    See regex proof.

    EXPLANATION

    -------------------------------------------------------------------------------
      (?:                      group, but do not capture (0 or more times
                               (matching the most amount possible)):
    --------------------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                                 more times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
        \.                       '.'
    --------------------------------------------------------------------------------
      )*                       end of grouping
    --------------------------------------------------------------------------------
      \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      \.                       '.'
    --------------------------------------------------------------------------------
      com                      'com'