Search code examples
pythonregexnewlinerawstring

understanding raw string for regular expressions in python


I have lots of text files full of newlines which I am parsing in python 3.4. I am looking for the newlines because they separate my text into different parts. Here is an example of a text :

text = 'avocat  ;\n\n       m. x'

I naïvely started looking for newlines with '\n' in my regular expression (RE) without thinking that the backslash '\' was an escape character. Howerver, this turned out to work fine:

>>> import re

>>> pattern1 = '\n\n'
>>> re.findall(pattern1, text)
['\n\n']

Then, I understood I should be using a double backslash in order to look for one backlash. This also worked fine:

>>> pattern2 = '\\n\\n'
>>> re.findall(pattern2, text)
['\n\n']

But on another thread, I was told to use raw strings instead of regular strings, but this format fails to find the newlines I am looking for:

>>> pattern3 = r'\\n\\n'
>>> pattern3
'\\\\n\\\\n'
>>> re.findall(pattern3, text)
[]

Could you please help me out here ? I am getting a little confused of what king of RE I should be using in order to correctly match the newlines.


Solution

  • Don't double the backslash when using raw string:

    >>> pattern3 = r'\n\n'
    >>> pattern3
    '\\n\\n'
    >>> re.findall(pattern3, text)
    ['\n\n']