python python-2.7 sqlite subprocess fuzzing

Problems with subprocess, encoding and logging with sqlite

I have searched for quite a while for the answer to this question and I think a lot of it has to do with my unfamiliarity with how the subprocess module works. This is for a fuzzing program if anyone is interested. Also, I should mention that this is all being done in Linux (I think that is pertinent) I have some code like this:

# open and run a process and log get return code and stderr information
process = subprocess.Popen([app, file_name], stdout=subprocess.PIPE,
                                             stderr=subprocess.PIPE)
return_code = process.wait()
err_msg = process.communicate()[1]

# insert results into an sqlite database log
log_cur.execute('''INSERT INTO log (return_code, error_msg) 
                   VALUES (?,?)''', [unicode(return_code), unicode(error_msg)])
log_db.commit()

99 out of 100 times it works just fine but occasionally i get an error similar to:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xce in position 43: invalid continuation byte

EDIT: Full-trace

Traceback (most recent call last):
  File "openscadfuzzer.py", line 72, in <module>
    VALUES (?,?)''', [crashed, err_msg.decode('utf-8')])
  File "/home/username/workspace/GeneralPythonEnv/openscadfuzzer/lib/python2.7/encodings/utf_8.py",    line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xdb in position 881: invalid continuation byte

Is this a problem with subprocess, the application that I am using it to run or my code? Any pointers would be appreciated (especially when it pertains to the correct usage of subprocess stdout and stderr).

Solution

My guess is that the problem is this call:

unicode(error_msg)

What is the type of error_msg? I'm fairly sure by default the subprocess APIs will return the raw bytes output by the child program, the call to unicode tries to convert the bytes into characters (code points), by assuming some encoding (in this case utf8).

My guess is that the bytes aren't valid utf8, but are valid latin1. You can specify what codec to convert between bytes and characters:

error_msg.decode('latin1')

Here's an example that hopefully demonstrates the problem and workaround:

>>> b'h\xcello'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.2/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 1: invalid continuation byte

>>> b'h\xcello'.decode('latin1')
'hÎllo'

A better solution might be to make your child process output utf8, but then that depends on what data your database is capable of storing also.