Search code examples
pythongoogle-app-enginegoogle-search-api

python appengine unicodeencodeerror on search api snippeted results


I'm crawling pages and indexing them with appengine search api (Spanish and Catalan pages, with accented characters). I'm able to perform searches and make a page of results.

Problem arises when I try to use a query object with snipetted_fields, as it always generates a UnicodeEncodeError:

  File "/home/otger/python/jobs-gae/src/apps/search/handlers/results.py", line 82, in find_documents
    return index.search(query_obj)
  File "/opt/google_appengine_1.7.6/google/appengine/api/search/search.py", line 2707, in search
    apiproxy_stub_map.MakeSyncCall('search', 'Search', request, response)
  File "/opt/google_appengine_1.7.6/google/appengine/api/apiproxy_stub_map.py", line 94, in MakeSyncCall
    return stubmap.MakeSyncCall(service, call, request, response)
  File "/opt/google_appengine_1.7.6/google/appengine/api/apiproxy_stub_map.py", line 320, in MakeSyncCall
    rpc.CheckSuccess()
  File "/opt/google_appengine_1.7.6/google/appengine/api/apiproxy_rpc.py", line 156, in _WaitImpl
    self.request, self.response)
  File "/opt/google_appengine_1.7.6/google/appengine/ext/remote_api/remote_api_stub.py", line 200, in MakeSyncCall
    self._MakeRealSyncCall(service, call, request, response)
  File "/opt/google_appengine_1.7.6/google/appengine/ext/remote_api/remote_api_stub.py", line 234, in _MakeRealSyncCall
    raise pickle.loads(response_pb.exception())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 52: ordinal not in range(128)

I've found a similar question on stackoverflow: GAE Full Text Search development console UnicodeEncodeError but it says that it was a bug fixed on 1.7.0. I get same error either using version 1.7.5 and 1.7.6.

When Indexing pages I add two fields: description and description_ascii. If I try to generate snippets for description_ascii it works perfectly.

Is this possible to generate snippets of not ascii contents on dev_appserver?


Solution

  • I think this is a bug, reported new defect issue https://code.google.com/p/googleappengine/issues/detail?id=9335.

    Temporary solution for dev server - locate google.appengine.api.search module (search.py), and patch function _DecodeUTF8 by adding inline if like this:

    def _DecodeUTF8(pb_value):
      """Decodes a UTF-8 encoded string into unicode."""
      if pb_value is not None:
        return pb_value.decode('utf-8') if not isinstance(pb_value, unicode) else pb_value
      return None
    

    Workaround - until the issue is solved implement snippet functionality yourself - assuming field which is base for snippet is called snippet_base:

    query = search.Query(query_string=query_string,
                     options=
                        search.QueryOptions(
                            ...
                            returned_fields= [... 'snippet_base' ...]
                            ))
    results = search.Index(name="<index-name>").search(query)
    if results:
        for res in results.results:
            res.snippet = some_snippeting_function(res.field("snippet_base"))