Search code examples
pythonutf-8python-unicode

'utf8' codec can't decode byte 0x92 when writing query results to csv


I am reading a query as text from a google sheet which is being passed as "str1" below. Here is my code:

# get query string from google sheets
# establish database connection
cursor = conn.cursor()
cursor.execute((str1))
results1 = cursor.fetchall()
cursor.close()
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
for row in results1:
    ws.append(row)

At which point I get the error as shown in the title:

  File "<stdin>", line 2, in <module>
  File "/Library/Python/2.7/site-packages/openpyxl/worksheet/worksheet.py", line 790, in append
    cell = Cell(self, row=row_idx, col_idx=col_idx, value=content)
  File "/Library/Python/2.7/site-packages/openpyxl/cell/cell.py", line 114, in __init__
    self.value = value
   File "/Library/Python/2.7/site-packages/openpyxl/cell/cell.py", line 294, in value
    self._bind_value(value)
  File "/Library/Python/2.7/site-packages/openpyxl/cell/cell.py", line 191, in _bind_value
    value = self.check_string(value)
  File "/Library/Python/2.7/site-packages/openpyxl/cell/cell.py", line 150, in check_string
    value = unicode(value, self.encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 48: invalid start byte

The data is regarding author/article information (we are a publisher). It contains content id, site code, byline, author, link to facebook ad, and pull date/time

Here is an example of a row of data which held an error:

(1693279, 'CPD', 'Morgan Dietrich', "20 Intuitive People Share Their 'Something Doesn\x92t Feel Right' Story That Turned Out To Be True", 'business.facebook.com/550634765042035/posts/…;, datetime.datetime(2017, 11, 29, 20, 49, 24))

I have read a lot of questions regarding this error but cannot figure out a solution. The results of the query (results1) - which runs successfully - are a tuple like this:

( (query result ro1/col1, query result ro1/col2, query result ro1/col3), 
  (query result ro2/col1, query result ro2/col2, query result ro2/col3), ... etc... )

I have tried .encode/.decode but they don't seem to work on a tuple. I've tried filtering for the bad characters but that didn't work.

How in the world do I resolve this issue? These utf8 related errors have caused me so much annoyance in the past and as much as I read it all still seems fairly confusing.

The row that fails is

(1693279,
 'CPD',
 'Morgan Dietrich',
 "20 Intuitive People Share Their 'Something Doesn\x92t Feel Right' Story That Turned Out To Be True",
 'https://business.facebook.com/550634765042035/posts/1223000787805426',
 datetime.datetime(2017, 11, 29, 20, 49, 24))

Solution

  • You have bytestring data with non-ASCII and non-UTF8 data in them. You'll either have to configure your database to return Unicode strings instead for that 4th column, or manually decode.

    The 0x92 byte is the U+2019 RIGHT SINGLE QUOTATION MARK codepoint if you decode the data as Windows Codepage 1252 (or a number of other windows codepages in the same range, but 1252 is the most likely for otherwise English text), so you could just try that:

    for row in results1:
        row = list(row)  # convert to list for easier mutation
        row[3] = row[3].decode('cp1252')
        ws.append(row)