I read various lines from a CSV file like this:
f1 = open(current_csv, 'rb')
table = f1.readlines()
f1.close()
So essentially any single line in table
is something like this:
line = b' G\xe4rmanword: 123,45\r\n'
which type
tells me is bytes
, but I need to work around with .replace
so I'm turning it into a string: line = str(line)
, but now line
turned into
"b' G\\xe4rmanword: 123,45\\r\\n'"
with and added \
before every \
. However, with print(line)
, they don't show up, but if I want to turn \xe4
into ae
(alternative way of writing ä) with line = line.replace('\xe4', 'ae')
this just does nothing. Using '\\xe4'
works, however. But I would have expected that the first one just turns \\xe4
into \ae
instead of just doing nothing, and the second option, while working, relies on my defining a new definition for the replacement for ä, both of which I'd rather avoid.
So I'm trying to understand where the extra backslash comes from and how I can avoid it to start with, instead of having to fix it in my postprocessing. I have the feeling that something changed between python2 and 3, since the original csv reader is a python2 script I had translated with 2to3
.
Yes, since Python3 uses Unicode for all strings, the semantics of many string-related functions including str
have changed compared to Python2. In this particular case, you need to use second argument to str
providing the encoding used in your input bytes
value (which, judging from the use of German language, is 'latin1'):
unicode_string = str(line, 'latin1')
Alternatively you can do the same using
unicode_string = line.decode('latin1')
And you'd probably want the \r\n
removed, so add .rstrip()
to that.
Besides, a more elegant solution for reading the file is:
with open(current_csv, 'rb') as f1:
table = f1.readlines()
(so no need for close()
)