Search code examples
pythonregexpython-3.xnewlinepyperclip

Regex problem searching through a pyperclip multipleline copied text


Happens to me a rare thing when trying to do a search with regex trough a pyperclip.paste() if the search expression involves a \n new line character.

Excuse my English.

When the search, I make it trough this triple quote assigned to a text variable:

import re

text = '''
This as the line 1
This as the line 2
'''

pattern = re.compile(r'\d\n\w+')
result = pattern.findall(text)
print(result)

It actually prints the new line character \n. Which is what I want, or almost what I expect.

»»» ['1\nThis']

But the problem starts when the string to search come from a text copied from the clipboard.

This as the line 1
This as the line 2

Say I just select and copy to clipboard that text and i want regex to extract the same previous output from it. This time I need to use pyperclip module.

So, forgetting the previous code and write this instead:

import re, pyperclip

text = pyperclip.paste()

pattern = re.compile(r'\d\n\w+')
result = pattern.findall(text)
print(result)

This is the result:

»»» [ ]

Nothing but two brackets. I discover (in my inexperience) that the problem causing this is the \n character. And it has nothing to do with a conflict between the python (also \n character), because we avoid that with 'r'.

I already found a not too clearly solution for this (for me almost, because I'm just with the basics of Python right now).

import re, pyperclip

text = pyperclip.paste()
lines = text.split('\n')
spam = ''

for i in lines:
    spam = spam + i

pattern = re.compile(r'\d\r\w+')
result = pattern.findall(spam)
print(result)

Note that instead of \n for detect new lines in the last regex expression, I opted to \r (\n would cause the same bad behavior printing only brackets). \r its exchangeable with \s, the output works, but:

»»» ['1\rThis']

With \r instead of \n

At least it was a little victory for me.

It'll helps me a lot if you could explain to me a better solution for this o almost understand why this happened. You also can recommend me some concepts to investigate to, for a fully comprehension of this.


Solution

  • The reason you are getting the \r when pasting is because you are pasting from a Windows machine. On windows, the newline characters are represented by \r\n. Note that \s is different from \r. \s means any whitespace characters. \r is only the carriage return character.

    The text:

    This as the line 1 This as the line 2

    actually looks like:

    This as the line 1\r\n This as the line 2\r\n

    on a windows machine.

    In the regex, the\d\r matches to end of the first line: 1\r but then the \w+ doesn't match the \n. You need to edit your first regex to be:

    pattern = re.compile(r'\d\r\n\w+')

    Source: Do line endings differ between Windows and Linux?