Search code examples
pythonregexstringpython-rerawstring

Python re.sub to remove single quotes AND double quotes from string


Here's a question that is quickly driving me mad. I want to remove both ' and " characters from a string. I want to use re.sub to do it (because I am trying to compare re.sub vs str.replace so I want to do it both ways). Now my understanding of raw strings is that escape characters are treated as literals UNLESS they are escaping the character that opened the string. So I Have two ideas for how to do this:

# Method 1: concatenate strings that have different enclosing characters
>>> REGEX1 = re.compile(r"[" + r'"' + r"'" + r"]")
>>> REGEX1.pattern
'["\']'
# Method 2: Try to escape one of the quotation characters
>>> REGEX2= re.compile(r"[\"']")
>>> REGEX2.pattern
'[\\"\']'

The patterns given LOOK different. Are they though? I test whether they behave the same in the regex:

>>> test_string = "hello ' world \" "
>>> test_string
'hello \' world " '
>>> result_1 = REGEX1.sub(r'', test_string)
>>> result_2 = REGEX2.sub(r'', test_string)
>>> result_1
'hello  world  '
>>> result_2
'hello  world  '
>>> 

My intuition tells me one of two things are possible:

  1. '["']' == '[\"']'
  2. '["']' != '[\"']', but will behave equivalently when treated as a regular expression.

One last test then:

>>> '["\']' == '[\\"\']'                                                                                                                                                                                      
False

So is 2) above the correct statement? Can you help me understand what's going on?


Solution

  • They look different as demonstrated when you display their values, but as far as being interpreted as regular expressions, they are equivalent:

    import re
    
    
    REGEX1 = re.compile(r"[" + r'"' + r"'" + r"]")
    print(REGEX1.pattern)
    print(REGEX1.sub('', """abc"'def"""))
    REGEX2= re.compile(r"[\"']")
    print(REGEX2.pattern)
    print(REGEX2.sub('', """abc"'def"""))
    

    Prints:

    ["']
    abcdef
    [\"']
    abcdef 
    

    Explanation

    The difference between the raw string r'\n' and the non-raw string '\n' is huge because the latter is a special escape sequence that equates to the newline character whereas the former is equivalent to '\\n', i.e. the two-character sequence of a backslash followed by the letter n. But for other cases such as '\", where backslash followed by the double-quote is not a special escape sequence, then the backslash is superfluous and can be ignored and thus ["'] and [\"'] are equivalent.

    Update

    Since I made the point about there being in general a big difference between escape sequences in raw strings vs. non-raw strings when what follows the backslash has special meaning following the backslash (e.g. r'\n' vs. '\n'), this is not always the case for all intents and purposes with regular expressions. For example, when used in regular expressions, the Python regular expression engine will match a newline character with either a regular expression compiled from the two-character sequence r'\n' (that is, '\\n') or the newline character '\n':

    import re
    
    
    REGEX1 = re.compile('a\nb') # use actual newline
    print('pattern1 = ', REGEX1.pattern)
    print(REGEX1.search('a\nb'))
    REGEX2 = re.compile(r'a\nb') # use '\\n'
    print('pattern 2 =', REGEX2.pattern)
    print(REGEX2.search('a\nb'))
    

    Prints:

    pattern1 =  a
    b
    <re.Match object; span=(0, 3), match='a\nb'>
    pattern 2 = a\nb
    <re.Match object; span=(0, 3), match='a\nb'>
    

    But raw strings are generally used because of situations where you might need, for example, r'\1' to refer back to capture group 1 and'\1' would have matched '\x01'.