Search code examples
pythonstringreplacepython-2to3

Turning 'bytes' into 'str': Why is a '\' added to '\n' and such?


I read various lines from a CSV file like this:

f1 = open(current_csv, 'rb')
table = f1.readlines()
f1.close()

So essentially any single line in table is something like this:

line = b' G\xe4rmanword:           123,45\r\n'

which type tells me is bytes, but I need to work around with .replace so I'm turning it into a string: line = str(line), but now line turned into

"b' G\\xe4rmanword:           123,45\\r\\n'"

with and added \ before every \. However, with print(line), they don't show up, but if I want to turn \xe4 into ae (alternative way of writing ä) with line = line.replace('\xe4', 'ae') this just does nothing. Using '\\xe4' works, however. But I would have expected that the first one just turns \\xe4 into \ae instead of just doing nothing, and the second option, while working, relies on my defining a new definition for the replacement for ä, both of which I'd rather avoid.

So I'm trying to understand where the extra backslash comes from and how I can avoid it to start with, instead of having to fix it in my postprocessing. I have the feeling that something changed between python2 and 3, since the original csv reader is a python2 script I had translated with 2to3.


Solution

  • Yes, since Python3 uses Unicode for all strings, the semantics of many string-related functions including str have changed compared to Python2. In this particular case, you need to use second argument to str providing the encoding used in your input bytes value (which, judging from the use of German language, is 'latin1'):

    unicode_string = str(line, 'latin1')
    

    Alternatively you can do the same using

    unicode_string = line.decode('latin1')
    

    And you'd probably want the \r\n removed, so add .rstrip() to that. Besides, a more elegant solution for reading the file is:

    with open(current_csv, 'rb') as f1:
        table = f1.readlines()
    

    (so no need for close())