Search code examples
google-app-enginepython-2.7zlibbulkloadergoogle-cloud-datastore

Trying to upload compressed data (unicode) via bulkuploader


I ran into an issue where the data being uploaded to db.text was over 1 mb, so I compressed the information using zlib. Bulkloader by default didn't support the unicode data data being uploaded, so I switched out the source code to use unicodecsv rather than python's built in csv module. The problem that I'm running into is that Google App Engine's bulkload is unable to support the unicode characters (even though the db.Text entity is unicode).

[ERROR   ] [Thread-12] DataSourceThread:
Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 1611, in run
    self.PerformWork()
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 1730, in PerformWork
    for item in content_gen.Batches():
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 542, in Batches
    self._ReadRows(key_start, key_end)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 452, in _ReadRows
    row = self.reader.next()
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/bulkload/csv_connector.py", line 219, in generate_import_record
    for input_dict in self.dict_generator:
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/unicodecsv/__init__.py", line 188, in next
    row = csv.DictReader.next(self)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
    row = self.reader.next()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/unicodecsv/__init__.py", line 106, in next
    row = self.reader.next()
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/bulkload/csv_connector.py", line 55, in utf8_recoder
    for line in codecs.getreader(encoding)(stream):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 612, in next
    line = self.readline()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 527, in readline
    data = self.read(readsize, firstline=True)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 474, in read
    newchars, decodedbytes = self.decode(data, self.errors)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 29: invalid start byte

I know that for my local testing I could modify the python files to use unicodecsv's module instead but that doesn't help solve the problem for using GAE's Datastore on production. Is there an existing solution to this problem that anyone is aware of?


Solution

  • Solved this the other week, you just need to base64 encode the results so you won't have any issues with bulkloader size increases by 30-50% but since zlib already compressed my data to 10% of the original this wasn't too bad.