Search code examples
pythonpython-3.xgitunicodeutf-8

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 173310: invalid start byte


In order to find all the authors, their total commits and their email id's, I cloned torvalds/linux repository from GitHub and from a python3 (Version 3.7.3) script ran the following code on it:

import subprocess
p = subprocess.Popen(['git shortlog -sne HEAD'], stdout=subprocess.PIPE, shell=True)
output = p.communicate()[0]
p.wait()
print(output.decode().split('\n')) #Decoding the byte string and splitting to get a python list of result lines.

And got the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 173310: invalid start byte

I don't know what is this and how to solve this problem?


Solution

  • The problem is that Linux commit history will likely (surely, given the results) contain data that was not utf-8 encoded in the fields you are retrieving there.

    The simplest thing to do is tell Python to ignore errors and replace what would be broken utf-8 sequences with a replacement character in the call to decode:

    print(output.decode(encoding="utf-8", errors="replace").split('\n'))
    

    The major problem with this is that it will throw away the original character and insert a Unicode replacement char in its place ('�').

    Depending on what you are doing, this will suffice (if you just want to look at the data on your screen, it is certainly enough).

    Otherwise, if one is doing that to fetch all committer names for historical or legal reasons, for example, it would be important to try to guess the original encoding for particular commits that are not in utf-8 - that would require, for example a try/except statement surrounded by a loop of encodings to be tried (like, try in sequence "utf-8", then "latin1", and so on. This approach have the downside that some encodings (latin1 itself, for example) will not yield an error even if it is the incorrect encoding. The name would end up mangled. If there are few cases where this happens - some tens, or a couple hundred cases, it might be worth fixing then manually rather than trying to get an algorithm to guess the correct encoding for each case. (after finding the correct spelling of one broken name, all subsequent occurrences would be solved anyway).