Search code examples
pythonregexsanitization

Proving a function has no proper input validation


The Problem:

I have this artificial example function:

def test_function(target, words):
    pattern = re.compile(r"|".join(words))

    return bool(pattern.search(target))

which takes a list of words and dynamically constructs a regular expression pattern without proper escaping the words in the list.

Usage samples:

text = "hello world!"

print(test_function(text, ["test"]))  # prints False
print(test_function(text, ["hello"]))  # prints True
print(test_function(text, ["test", "world"]))  # prints True

The Question:

How can I test this function to prove that there is no proper regular expression escaping or input sanitization?

In other words, what items in a words list should I provide to "break" this function?


I've tried several "evil" regexes to simulate catastrophic backtracking and force the function to hang like (x+x+)+y or (a+)+, but the function just returns False instantly and there is no indication of a problem.


Solution

  • There are plenty of ways you could do this. For example, a word that isn't a valid regex:

    >>> test_function('a', ['*'])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 2, in test_function
      File "/usr/lib64/python2.6/re.py", line 190, in compile
        return _compile(pattern, flags)
      File "/usr/lib64/python2.6/re.py", line 245, in _compile
        raise error, v # invalid expression
    sre_constants.error: nothing to repeat
    

    or a word that matches everything as a regex:

    >>> test_function('a', ['.*'])
    True
    

    or a word that doesn't match what it should as a regex:

    >>> test_function('$^', ['$^'])
    False
    

    or a word that ends in a backslash and escapes the |:

    >>> test_function('a', ['\\', 'a'])
    False
    

    Catastrophic backtracking works too:

    >>> test_function('a'*100, ['(a+)+b'])
    # Hangs.