Search code examples
pythonnon-ascii-characters

Python convert binary file into string while ignoring non-ascii characters


I have a binary file and I want to extract all ascii characters while ignoring non-ascii ones. Currently I have:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

However I'm encountering an error when writing to file UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128). How would I get Python to ignore non-ascii?


Solution

  • Use the built-in ASCII codec and tell it to ignore any errors, like:

    with open(filename, 'rb') as fobj:
       text = fobj.read().decode('utf-16-le')
       file = open("text.txt", "w")
       file.write("{}".format(text.encode('ascii', 'ignore')))
       file.close()
    

    You can test & play around with this in the Python interpreter:

    >>> s = u'hello \u00a0 there'
    >>> s
    u'hello \xa0 there'
    

    Just trying to convert to a string throws an exception.

    >>> str(s)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)
    

    ...as does just trying to encode that unicode string to ASCII:

    >>> s.encode('ascii')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)
    

    ...but telling the codec to ignore the characters it can't handle works okay:

    >>> s.encode('ascii', 'ignore')
    'hello  there'