Search code examples
pythonpython-2.5

how to append all econdings to a string


I've a system in python 2.5 which process files in all languages and encoding, I want to log some things, and i'm not really interested in the non standard chars, i'm willing to use only ascii chars to the log, however I'm receiving from time to time errors like.

<type 'tuple'>: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u'Create project: 2016 May European Tour: There\u2019s Still Time to Buy Tickets!', 45, 46, 'ordinal not in range(128)'), <traceback object at 0x105b84908>)

This is some example of the code i've tried:

this works most of the time, not always

self.__log += data.decode('utf-8', 'ignore').encode("utf-8")

This failed, but it worked on a few previous don't work

self.__log += data.encode('ascii', 'ignore')

This worked for some other cases.

self.__log += data.decode('utf-8', 'replace')

the log is right now being defined as

self.__log = ""

But i've also tried with

self.__log = u""

The problem is that im not able to create a solution that works for all the cases, what should i do?


Solution

  • If you don't know what you are receiving, there's no nice and universal way.

    If you're comfortable with throwing away anything non-ascii, and badly mangling data when the data are not ascii, you can try something like this:

    def forceAscii(s):
      if isinstance(s, unicode):
        return unicode(s.encode('ascii', 'replace'))
      elif isinstance(s, basestring):
        return s.decode('ascii', 'replace').encode('ascii', 'replace')
      else:
        raise ValueError('Expected a string, got a %r' % type(s))
    

    This will give you a Unicode string which only contains ascii characters, given any Unicode or byte string. Characters that cannot be coerced to ascii will be replaced with '?' marks.

    Note that some encodings will end up with some characters badly mangled, e.g. mapped to non-printable ascii characters like \x00.