Search code examples
pythonregexregex-lookaroundsregexp-replace

Regex whitelisted url lets blocked urls through on the same line in message


So I have a regex expression which blocks URLs in a message, but I want to whitelist the site's URL.

currently it works with any prefix like HTTP://www.example.com and with www.example.com/support/how-do-i-setup-this but if I put another URL behind this then it gets through the filter which I don't want (only if I put the new URL on a new line it gets blocked as required)

"go to http://example.com/support/how-do-i www.badurl.com" this doesn't block the badurl which I want to happen

also this string results in both being blocked "www.badurl.comexample.com" but ideally I would like to whitelist the example.com URL here too

[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,24}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?(?<!\bexample.com(/.*)?)

Current python function code

import re

def link_remover(message):
   #remove any links that aren't in whitelist
   message = re.sub(r"[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,24}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?(?<!\bexample.com)", "[URL Removed]", message)
   return message

so I'm just wondering how to edit it to fix those two examples which fail?

I appreciate any responses or pointing me in the right direction :)


Solution

  • Final Update:
    Added a negative lookbehind boundary that starts every check for a whitelist item.
    Example: (?<=(?<![-a-z\u00a1-\uffff0-9])example\.com)
    This class ensures that only an optional Subdomain can come before it.
    As the only optional parts of the regex allow only a dot or forward slash.
    Therefore no bleed of letters can be adjacent to it, for example wrongexample.com .

    This is an example where the whitelist items are optionally matched.
    Every url is matched. The whitelist check is strategically placed right after the domain
    is matched. Therefore the match will encompass any trailing optional ports or directories.

    A lambda callback is all that's needed to check if any of the whitelist urls matched.
    If so, just write them back unchanged.
    If none matched then write back the Removed string.

    Modified logic:
    Changed to only need one capture group.
    The group is used as a flag.

    If the group is None, no whitelist item was found for the match.
    Returns return {Empty} in the callback and overwrites the bad url.

    Otherwise a whitelist item was found. The match is return unchanged.
    return m.group(0).

    Notes:
    All url's are matched. Single capture group. Unlimited number of whitelist items.
    Follow the template below to add the whitelist items.

    (?!mailto:)(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,}))((?<=(?<![-a-z\u00a1-\uffff0-9])example\.com)|(?<=(?<![-a-z\u00a1-\uffff0-9])example1\.com)|(?<=(?<![-a-z\u00a1-\uffff0-9])example2\.com))?))|localhost)(?::\d{2,5})?(?:\/[^\s]*)?
    

    https://regex101.com/r/rCBd0P/1

    Python Code Sample:

    import re
     
    def ConvertURL_func(input_text):
      #
      def repl(m):
        if m.group(1) == None: return "{Removed}"
        return m.group(0)
      #
      input_text = re.sub(r"(?!mailto:)(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,}))((?<=(?<![-a-z\u00a1-\uffff0-9])example\.com)|(?<=(?<![-a-z\u00a1-\uffff0-9])example1\.com)|(?<=(?<![-a-z\u00a1-\uffff0-9])example2\.com))?))|localhost)(?::\d{2,5})?(?:\/[^\s]*)?",repl,input_text)
      return input_text
    
    # input URL strings example:
    input_text = '''
    bad.com
    www.example.com
    example.com
    www.badurl.com www.badurlexample2.com
    www.badurl.com example1.com
    https://www.example2.com
    '''
    
    input_text = ConvertURL_func(input_text)
    print(input_text)
    

    Outout:

    >>> print(input_text)
    
    {Removed}
    www.example.com
    example.com
    {Removed} {Removed}
    {Removed} example1.com
    https://www.example2.com
    
    >>>
    

    Regex expanded:

     (?! mailto: )
     (?:
        (?: https? | ftp )
        :\/\/
     )?
     (?:
        \S+ 
        (?: : \S* )?
        @
     )?
     (?:
        (?:
           (?:
              [1-9] \d? 
            | 1 \d\d 
            | 2 [01] \d 
            | 22 [0-3] 
           )
           (?:
              \.
              (?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
           ){2}
           (?:
              \.
              (?:
                 [1-9] \d? 
               | 1 \d\d 
               | 2 [0-4] \d 
               | 25 [0-4] 
              )
           )
         | 
           (?:
              (?:
                 (?: [a-z\u00a1-\uffff0-9]+ -? )*
                 [a-z\u00a1-\uffff0-9]+ 
              )
              (?:
                 \.
                 (?: [a-z\u00a1-\uffff0-9]+ -? )*
                 [a-z\u00a1-\uffff0-9]+ 
              )*
              (?:
                 \.
                 (?: [a-z\u00a1-\uffff]{2,} )
              )
              (                           # (1 start)
                 # Start Whitelist
                 
                 (?<=
                    (?<! [-a-z\u00a1-\uffff0-9] )
                    example\.com
                 )
               | (?<=
                    (?<! [-a-z\u00a1-\uffff0-9] )
                    example1\.com
                 )
               | (?<=
                    (?<! [-a-z\u00a1-\uffff0-9] )
                    example2\.com
                 )
                 
                 # Add more whitelist items
              )?                          # (1 end)
           )
        )
      | localhost
     )
     (?: : \d{2,5} )?
     (?: \/ [^\s]* )?