Search code examples
pythonpython-3.xutf-8decodeencode

In Python 3, how do you remove all non-UTF8 characters from a string?


I'm using Python 3.7. How do I remove all non-UTF-8 characters from a string? I tried using "lambda x: x.decode('utf-8','ignore').encode("utf-8")" in the below

coop_types = map(
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
    filter(None, set(d['type'] for d in input_file))
)

but this is resulting in the error ...

Traceback (most recent call last):
  File "scripts/parse_coop_csv.py", line 30, in <module>
    for coop_type in coop_types:
  File "scripts/parse_coop_csv.py", line 25, in <lambda>
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
AttributeError: 'str' object has no attribute 'decode'

If you have a generic way to remove all non-UTF8 chars from a string, that's all I'm looking for.


Solution

  • You're starting with a string. You can't decode a str (it's already decoded text, you can only encode it to binary data again). UTF-8 encodes almost any valid Unicode text (which is what str stores) so this shouldn't come up much, but if you're encountering surrogate characters in your input, you could just reverse the directions, changing:

    x.decode('utf-8','ignore').encode("utf-8")
    

    to:

    x.encode('utf-8','ignore').decode("utf-8")
    

    where you encode any UTF-8 encodable thing, discarding the unencodable stuff, then decode the now clean UTF-8 bytes.