The Problem:
I have this artificial example function:
def test_function(target, words):
pattern = re.compile(r"|".join(words))
return bool(pattern.search(target))
which takes a list of words and dynamically constructs a regular expression pattern without proper escaping the words in the list.
Usage samples:
text = "hello world!"
print(test_function(text, ["test"])) # prints False
print(test_function(text, ["hello"])) # prints True
print(test_function(text, ["test", "world"])) # prints True
The Question:
How can I test this function to prove that there is no proper regular expression escaping or input sanitization?
In other words, what items in a words
list should I provide to "break" this function?
I've tried several "evil" regexes to simulate catastrophic backtracking and force the function to hang like (x+x+)+y
or (a+)+
, but the function just returns False
instantly and there is no indication of a problem.
There are plenty of ways you could do this. For example, a word that isn't a valid regex:
>>> test_function('a', ['*'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in test_function
File "/usr/lib64/python2.6/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib64/python2.6/re.py", line 245, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
or a word that matches everything as a regex:
>>> test_function('a', ['.*'])
True
or a word that doesn't match what it should as a regex:
>>> test_function('$^', ['$^'])
False
or a word that ends in a backslash and escapes the |
:
>>> test_function('a', ['\\', 'a'])
False
Catastrophic backtracking works too:
>>> test_function('a'*100, ['(a+)+b'])
# Hangs.