Here's a question that is quickly driving me mad. I want to remove both ' and " characters from a string. I want to use re.sub to do it (because I am trying to compare re.sub vs str.replace so I want to do it both ways). Now my understanding of raw strings is that escape characters are treated as literals UNLESS they are escaping the character that opened the string. So I Have two ideas for how to do this:
# Method 1: concatenate strings that have different enclosing characters
>>> REGEX1 = re.compile(r"[" + r'"' + r"'" + r"]")
>>> REGEX1.pattern
'["\']'
# Method 2: Try to escape one of the quotation characters
>>> REGEX2= re.compile(r"[\"']")
>>> REGEX2.pattern
'[\\"\']'
The patterns given LOOK different. Are they though? I test whether they behave the same in the regex:
>>> test_string = "hello ' world \" "
>>> test_string
'hello \' world " '
>>> result_1 = REGEX1.sub(r'', test_string)
>>> result_2 = REGEX2.sub(r'', test_string)
>>> result_1
'hello world '
>>> result_2
'hello world '
>>>
My intuition tells me one of two things are possible:
One last test then:
>>> '["\']' == '[\\"\']'
False
So is 2) above the correct statement? Can you help me understand what's going on?
They look different as demonstrated when you display their values, but as far as being interpreted as regular expressions, they are equivalent:
import re
REGEX1 = re.compile(r"[" + r'"' + r"'" + r"]")
print(REGEX1.pattern)
print(REGEX1.sub('', """abc"'def"""))
REGEX2= re.compile(r"[\"']")
print(REGEX2.pattern)
print(REGEX2.sub('', """abc"'def"""))
Prints:
["']
abcdef
[\"']
abcdef
Explanation
The difference between the raw string r'\n'
and the non-raw string '\n'
is huge because the latter is a special escape sequence that equates to the newline character whereas the former is equivalent to '\\n'
, i.e. the two-character sequence of a backslash followed by the letter n. But for other cases such as '\"
, where backslash followed by the double-quote is not a special escape sequence, then the backslash is superfluous and can be ignored and thus ["']
and [\"']
are equivalent.
Update
Since I made the point about there being in general a big difference between escape sequences in raw strings vs. non-raw strings when what follows the backslash has special meaning following the backslash (e.g. r'\n'
vs. '\n'
), this is not always the case for all intents and purposes with regular expressions. For example, when used in regular expressions, the Python regular expression engine will match a newline character with either a regular expression compiled from the two-character sequence r'\n'
(that is, '\\n'
) or the newline character '\n'
:
import re
REGEX1 = re.compile('a\nb') # use actual newline
print('pattern1 = ', REGEX1.pattern)
print(REGEX1.search('a\nb'))
REGEX2 = re.compile(r'a\nb') # use '\\n'
print('pattern 2 =', REGEX2.pattern)
print(REGEX2.search('a\nb'))
Prints:
pattern1 = a
b
<re.Match object; span=(0, 3), match='a\nb'>
pattern 2 = a\nb
<re.Match object; span=(0, 3), match='a\nb'>
But raw strings are generally used because of situations where you might need, for example, r'\1'
to refer back to capture group 1 and'\1'
would have matched '\x01'
.