encode hash in utf-8

I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.

result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))

I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded. I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.

[unicodedecodeerror: 'ascii' codec can't decode byte in position ordinal not in range(128)]

Sample input:

Start: myUsername: myÜsername:

What am I missing ?

EDIT_

Traceback (most recent call last):
  File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
    encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)

Solution

Based on your symptoms, you're running on Python 2. Calling encode on a Python 2 str is almost always nonsensical.

You have two problems; one you're hitting now, and one you'll hit if you fix your current code.

Your first problem is line is already a str in (apparently) UTF-8 encoded bytes, not unicode, so encodeing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). Basically, line was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode as ASCII first, and failed before it even tried to encode as you instructed.

The solution to this problem is to just not encode line at all; it's already UTF-8 encoded, so you're already golden.

Your second problem (which you haven't encountered yet, but you will) is that you're calling encode on the group(4) result. But of course, since the input was a str, the group is a str too, and you'll encounter the same problem trying to encode a str; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError during the implicit decode step before the encode.

The reason:

import sys

reload(sys)
sys.setdefaultencoding('UTF8')

works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode calls now perform the implicit decode with UTF-8 instead of ASCII; the decode and encode is mostly pointless, since all it does is return the original str after confirming it's legal UTF-8 by means of decodeing it as such, and otherwise acting as an expensive no-op.

To fix the second problem, just change:

m.group(4).encode()

to:

m.group(4)

That leaves your final code as:

result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
                lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
                line)

Optionally, if you want to confirm your expectation that line is in fact UTF-8 encoded bytes already, add the following above that re.sub line:

try:
    line.decode('utf-8')
except Exception as e:
    sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))

which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line is, so you can confirm for sure if it's really str or unicode, since str implies you chose the wrong codec, while unicode means your inputs aren't of the expected type).