Search code examples
pythonregexescapingdoctestrawstring

Passing string with (accidental) escape character loses character even though it's a raw string


I have a function with a python doctest that fails because one of the test input strings has a backslash that's treated like an escape character even though I've encoded the string as a raw string.

My doctest looks like this:

>>> infile = [ "Todo:        fix me", "/** todo: fix", "* me", "*/", r"""//\todo      stuff to fix""", "TODO fix me too", "toDo bug 4663" ]
>>> find_todos( infile )
['fix me', 'fix', 'stuff to fix', 'fix me too', 'bug 4663']

And the function, which is intended to extract the todo texts from a single line following some variation over a todo specification, looks like this:

todos = list()
for line in infile:
    print line
    if todo_match_obj.search( line ):
        todos.append( todo_match_obj.search( line ).group( 'todo' ) )

And the regular expression called todo_match_obj is:

r"""(?:/{0,2}\**\s?todo):?\s*(?P<todo>.+)"""

A quick conversation with my ipython shell gives me:

In [35]: print "//\todo"
//      odo

In [36]: print r"""//\todo"""
//\todo

And, just in case the doctest implementation uses stdout (I haven't checked, sorry):

In [37]: sys.stdout.write( r"""//\todo""" )
//\todo

My regex-foo is not high by any standards, and I realize that I could be missing something here.

EDIT: Following Alex Martellis answer, I would like suggestions on what regular expression would actually match the blasted r"""//\todo fix me""". I know that I did not originally ask for someone to do my homework, and I will accept Alex's answer as it really did answer my question (or confirm my fears). But I promise to upvote any good solutions to my problem here :)

EDITEDIT: for reference, a bug has been filed with the kodos project: bug #437633

I'm using Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)

Thank you for reading this far (If you skipped directly down here, I understand)


Solution

  • Read your original regex carefully:

    r"""(?:/{0,2}\**\s?todo):?\s*(?P<todo>.+)"""
    

    It matches: zero to two slashes, then 0+ stars, then 0 or 1 "whitespace characters" (blanks, tabs etc), then the literal characters 'todo' (and so on).

    Your rawstring is:

    r"""//\todo      stuff to fix"""
    

    so there's a literal backslash between the slashes and the 'todo', therefore of course the regex doesn't match it. It can't -- nowhere in that regex are you expressing any desire to optionally match a literal backslash.

    Edit: A RE pattern, very close to yours, that would accept and ignore an optional backslash right before the 't' would be:

    r"""(?:/{0,2}\**\s?\\?todo):?\s*(?P<todo>.+)"""
    

    note that the backslash does have to be repeated, to "escape itself", in this case.