Search code examples
regexre2

Is this kind of regex possible without negative lookahead?


Basically the regex im looking to create is something that would match every domain google except google.com and google.com.au

So google.org google.uk or google.com.pk would be a match. Im working within the limitations of re2 and the best i've been able to come up with is

google\.([^c][^o][^m]\.?[^a]?[^u]?)

This doesnt work for the extended domains like google.com.pk and it doesnt work if the root is double digit eg. .cn instead of .org etc

It works if there's no extended domain and the root isnt two digit google.org matches google.com doesnt match

Here's the link with test cases. regexr.com/7rbkn

Im looking for a workaround for negative lookahead. Or whether its possible to accomodate this within a single regex string.


Solution

  • Sure you can. The pattern will look a bit ugly, but what you are asking for is totally possible.

    Let's assume that the input already satisfy the regex google(?:\.[a-z]+)+ (i.e. google followed by at least one domain names) for ease of explanation. If you want more precision, see this answer.

    Match a name that is not a given name

    The inverted of com would be:

    • A name that is shorter or longer than 3, or
    • A name of length 3 whose:
      • The first character is not c, or
      • The second character is not o, or
      • The third character is not m.

    Translate that to regex and we have:

    \A                    # This means "at the very start"
    (?:
      [a-z]{1,2} |
      [a-z]{4,} |
    
      [^c.][a-z]{2} |     # Also exclude the dot,
      [a-z][^o.][a-z] |   # otherwise 'google.c.m'
      [a-z]{2}[^m.]       # would not match
    )
    \z                    # This means "at the very end"
    

    The same applies to au:

    \A(?:[a-z]|[a-z]{3,}|[^a.][a-z]|[a-z][^u.])\z
    

    Match a hostname that is not a given hostname

    There are two cases you want to avoid: google.com and google.com.au. The inverted of that would be the union of the following cases:

    • 1 extra names:
      • google.* where * is any name but com
    • 2 extra names:
      • google.*.* where the first * is any name but com, or
      • google.com.* where * is any name but au
    • 3 extra names or more: google.*.*.* ...

    Or, a bit more logical:

    • If the first name is not com, it doesn't matter how many names are left.
      • Any hostname following this pattern already differs from our exclude cases by one name.
    • If the first name is com and the second name is not au, the rest of the names are also irrelevant.
      • ...for the exact same reason above.
    • If the first and second names are com and au correspondingly, then there must be at least one other name, which means there are at least three extra names.
      • ...and if there are three extra names, then we don't need to check the first and the second at all.

    That said, we only need three branches. Let com be the inverted of com, here's what the pattern looks like in pseudo-regex:

    \A
    (?:
      google\.com    (?:\.[a-z]+)*   |
      google\.com\.au(?:\.[a-z]+)*   |
      google         (?:\.[a-z]){3,}
    )
    \z
    

    See the common parts? We can extract them out:

    \A
    google
    (?:
      \.com          |
      \.com\.au      |
      (?:\.[a-z]){3}
    )
    (?:\.[a-z]+)*
    \z
    

    Insert what we had from section 1, and voilà.

    The final pattern

    \A
    google
    (?:
      # google.com
      \.
      (?:
        [a-z]{1,2} | [a-z]{4,} |
        [^c.][a-z]{2} |
        [a-z][^o.][a-z] |
        [a-z]{2}[^m.]
      )
    |
      # google.com.au
      \.com\.
      (?:
        [a-z] | [a-z]{3,} |
        [^a.][a-z] | [a-z][^u.]
      )
    |
      # google.*.*.*
      (?:\.[a-z]+){3}
    )
    (?:\.[a-z]+)*
    \z
    

    Try it on regex101.com: PCRE2 with comments, Go, multiline mode.