Search code examples
pythonreplaceencode

make python replace un-encodable chars with a string by default


I want to make python ignore chars it can't encode, by simply replacing them with the string "<could not encode>".

E.g, assuming the default encoding is ascii, the command

'%s is the word'%'ébác'

would yield

'<could not encode>b<could not encode>c is the word'

Is there any way to make this the default behavior, across all my project?


Solution

  • The str.encode function takes an optional argument defining the error handling:

    str.encode([encoding[, errors]])
    

    From the docs:

    Return an encoded version of the string. Default encoding is the current default string encoding. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Codec Base Classes. For a list of possible encodings, see section Standard Encodings.

    In your case, the codecs.register_error function might be of interest.

    [Note about bad chars]

    By the way, note when using register_error that you'll likely find yourself replacing not just individual bad characters but groups of consecutive bad characters with your string, unless you pay attention. You get one call to the error handler per run of bad chars, not per char.