Search code examples
pythonregexgitleaks

regex for ip-address, domain and url


Problem statement:

I am trying to generate regex for ip-address, domain and url. These are my defitions:

IP Address:

93.114.205.169 

Domain:

example.com 
sub.example.com

Url:

93.114.205.169/path 
example.com/path
sub.example.com/path

So, an url always has a path to resource. But an IP-Address or domain should not have path to resource otherwise it would be an URL. Also note that these IP-address, domain and url can have http or https optionally with or without www.


My attempt:

I have tried various ways for these regex:

[[rules]]
id = "ip-address"
description = "Potential IP Address detected."
regex = '''\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'''
entropy = 2
keywords = ["ip"]


[[rules]]
id = "domain"
description = "Potential domain name detected."
regex = '''\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'''
entropy = 2
keywords = ["domain"]

[[rules]]
id = "url"
description = "Potential URL detected."
regex = '''\b(?:https?|ftp):\/\/(?:[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}|\d{1,3}(?:\.\d{1,3}){3})(?::\d{1,5})?(?:\/[^\s\"<>]*)?\b'''
entropy = 2
keywords = ["http", "https", "ftp", "url"]

But, these regex covering ip-address as url. For example, this ip-address http://93.114.205.169 is covering in url not as ip-address which should be only as ip-address but not as url according to my above definitions.

I changed to these regex:

[[rules]]
id = "ip-address"
description = "Potential IP Address detected."
regex = '''\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'''
entropy = 2
keywords = ["ip"]

[[rules]]
id = "domain"
description = "Potential domain name detected."
regex = '''\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'''
entropy = 2
keywords = ["domain"]

[[rules]]
id = "url"
description = "Potential URL detected."
regex = '''\b(?:https?|ftp):\/\/(?:[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}|\d{1,3}(?:\.\d{1,3}){3})(?::\d{1,5})?(?:\/[^\s\"<>]*)?\b'''
entropy = 2
keywords = ["http", "https", "ftp", "url"]

This also has same problem as above as http://93.114.205.169 recogize as url, but also it recognizing as ip-address too. It means it indentifies ip-address but it can also returning ip-addresses from urls like this http://93.114.205.169/path as url and 93.114.205.169 as ip-address.


Could you suggest me correct regex for my these definitions:

IP Address:

93.114.205.169 

Domain:

example.com 
sub.example.com

Url:

93.114.205.169/path 
example.com/path
sub.example.com/path

These IP-address, domain and url can have http or https optionally with or without www.


Solution

  • Some Details Are Not Clear

    The "code" that you posted is not Python but rather appears to be some sort of configuration file. Without understanding how the input is being processed with this configuration, it is difficult to give you a precise answer. An example will illustrate this:

    It appears based on your English language description that a URL is essentially either an IP address or a domain specification followed by a path, which starts with a '/' character (I will assume that a such a forward can but need not be followed by alpha characters so that '200.12.119.1/' is a URL). Let's say that we have a regex for detecting IP addresses and we are able to match, for example, '250.127.100.2' or 'http://250.127.100.2' with this regex. But it would be erroneous to match an IP address within the string '250.127.100.2/somepath'.

    We could create a single regex that was the "or-ing" of separate regular expressions for detecting a URL, a domain and an IP address such as:

    ip_regex = r'some regex'
    domain_regex = r'some regex'
    url_regex = fr'(?:{ip_regex}|{domain_regex})/[a-zA-z]*'
    rex = f'{url_regex}|{ip_regex}|{domain_regex}'
    

    So rex is a final or-ing of 3 subexpressions with the match for a URL being the first alternate subexpression. If we were to use this regular expression using method re.finditer we could then iterate the return from this method and find all matches and we would only match for example an IP address if it were not part of a larger URL match since we are trying to match a URL first. But what you posted leaves it very open to question as this is even possible. Your actual Python code would need to take the individual regexes in your configuration file and join them together with a '|' between them.

    The second and most likely alternative is that the input is being tested by individual regular expressions. So if we are just looking for say IP addresses, our regex for such a match now needs to use a negative lookahead to ensure that the candidate match is not followed by a '/' character.

    Suggestions

    First, if you really want to validate proper IP addresses, you would want to use something like:

    (?x)
    (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}
    (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])
    

    There are more concise expressions for validating an IP address, but the above regex is the most readable. It would accept '123.255.12.1' but not '333.255.12.1'. We might even want to reject '123.255.12.1' depending on what precedes it and follows it. For example, the string '123.255.12.1.99' contains a couple of valid IP addresses, i.e. '123.255.12.1' and '255.12.1.99', but I suspect we might not wish to accept either. In this case, we might add some negative assertions:

    (?x)
    (?<![0-9.])
    (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}
    (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])
    (?![0-9.])
    

    Now we are ensuring that out candidate match is not preceded or followed by a digit or decimal point.

    The following program demonstrates the above points. The first 2 calls to re.finditer where we are matching IP addresses and domains use regexes that have negative lookahead assertions. These regexes are what you would use if the Python code that uses the configuration file needs the ability to just look for one specific type of entity. The final call to re.finditer uses the "or-ing" of 3 regexes the first two of which do not require the negative lookahead insertions because an IP or domain is only matched if we can't match the longer URL.

    Needless to say, if you need to initialize a configuration file, then where I use f-strings to join together previously defined regex expressions, you would need to do this manually. I would suggest then that you print out the regexes and remove the extraneous whitespace I use with the (?x) flag.

    import re
    
    prefix = r'(?:https?://)'
    
    basic_ip_regex = fr'''
        (?:{prefix}|(?<![0-9.]))  # preceded by http:// or not preceded by a digit or period
        (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){{3}}
        (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])
        (?![0-9.])  # Not followed by a digit or period
    '''
    
    ip_regex = fr'''(?x)
        {basic_ip_regex}
        (?!/)  # Not followed additionally by a /
    '''
    
    basic_domain_regex = fr'''
        (?:{prefix}|(?<!\.))  # optionaly preceded by http:// or not preceded by a period
        (?:www\.)?(?:[a-zA-Z]+\.)+[a-zA-Z]++  # Match as many alpha characters as possible
        (?!\.)  # not followed by a period
    '''
    
    domain_regex = fr'''(?x)
        {basic_domain_regex}
        (?!/)  # Not followed additionally by a /
    '''
    
    basic_url_regex = fr'''
        (?:
            {basic_ip_regex}
            |
            {basic_domain_regex}
        )
        /[a-zA-Z]*  # / by iteself is a path
    '''
    
    url_regex = f'(?x){basic_url_regex}'
    
    # If we can use finditer:
    rex = fr'''(?x)
        (?P<url>){basic_url_regex}
        |
        (?P<ip>){basic_ip_regex}
        |
        (?P<domain>){basic_domain_regex}
    '''
    
    text = """
      123.45.67.89 # IP address
      http://123.45.67.89 # IP address
      https://123.45.67.89 # IP address
      123.45.67.89.99  # Invalid
      323.45.67.89  # Invalid
      booboo.com  # domain
      www.booboo.com  # domain
      http://www.booboo.com  # domain
      https://www.booboo.com  # domain
      123.45.67.89/abc  # URL
      http://23.45.67.89/abc  # URL
      https://23.45.67.89/abc  # URL
      https://booboo.com/abc  # URL
      123.45.67.89/  # URL
    """
    
    # Just look for IP addresses:
    for m in re.finditer(ip_regex, text):
        print('IP', m[0])
    
    print('\n************\n')
    
    # Just look for domains
    for m in re.finditer(domain_regex, text):
        print('domain', m[0])
    
    print('\n************\n')
    
    # Just look for URLs
    for m in re.finditer(url_regex, text):
        print('URL', m[0])
    
    print('\n************\n')
    
    # Look for everything:
    for m in re.finditer(rex, text):
        print(m.lastgroup, m[0])
    

    Prints:

    IP 123.45.67.89
    IP http://123.45.67.89
    IP https://123.45.67.89
    
    ************
    
    domain booboo.com
    domain www.booboo.com
    domain http://www.booboo.com
    domain https://www.booboo.com
    
    ************
    
    URL 123.45.67.89/abc
    URL http://23.45.67.89/abc
    URL https://23.45.67.89/abc
    URL https://booboo.com/abc
    URL 123.45.67.89/
    
    ************
    
    ip 123.45.67.89
    ip http://123.45.67.89
    ip https://123.45.67.89
    domain booboo.com
    domain www.booboo.com
    domain http://www.booboo.com
    domain https://www.booboo.com
    url 123.45.67.89/abc
    url http://23.45.67.89/abc
    url https://23.45.67.89/abc
    url https://booboo.com/abc
    url 123.45.67.89/