I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))
I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded. I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.
[unicodedecodeerror: 'ascii' codec can't decode byte in position ordinal not in range(128)]
Sample input:
Start: myUsername: myÜsername:
What am I missing ?
EDIT_
Traceback (most recent call last):
File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)
Based on your symptoms, you're running on Python 2. Calling encode
on a Python 2 str
is almost always nonsensical.
You have two problems; one you're hitting now, and one you'll hit if you fix your current code.
Your first problem is line
is already a str
in (apparently) UTF-8 encoded bytes, not unicode
, so encode
ing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). Basically, line
was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode
as ASCII first, and failed before it even tried to encode
as you instructed.
The solution to this problem is to just not encode
line
at all; it's already UTF-8 encoded, so you're already golden.
Your second problem (which you haven't encountered yet, but you will) is that you're calling encode
on the group(4)
result. But of course, since the input was a str
, the group is a str
too, and you'll encounter the same problem trying to encode
a str
; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError
during the implicit decode step before the encode.
The reason:
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode
calls now perform the implicit decode
with UTF-8 instead of ASCII; the decode
and encode
is mostly pointless, since all it does is return the original str
after confirming it's legal UTF-8 by means of decode
ing it as such, and otherwise acting as an expensive no-op.
To fix the second problem, just change:
m.group(4).encode()
to:
m.group(4)
That leaves your final code as:
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
line)
Optionally, if you want to confirm your expectation that line
is in fact UTF-8 encoded bytes already, add the following above that re.sub
line:
try:
line.decode('utf-8')
except Exception as e:
sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))
which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line
is, so you can confirm for sure if it's really str
or unicode
, since str
implies you chose the wrong codec, while unicode
means your inputs aren't of the expected type).