Search code examples
pythonpython-2.7unicodemojibake

Using python to clean up string encoding problems in text files


I have a lot of XML documents and filenames for external files that have various forms of text corruption or Mojibake causing data quality problems during import. I've read a number of different posts on StackOverflow on correcting strings, but they fail to really outline how to cleanup text in a systematic way and python's decode, encode don't seem to be helping. How can I recover an XML file and filenames, using Python 2.7, containing characters in the range of Latin-1 (ISO-8859-1) but have mixed encodings generally?


Solution

  • You have to make assumptions

    If you can't make assumptions about the kinds of letters that you'll be encountering, you're probably in trouble. So it is good that in our document we can reasonably assume the Norwegian alphabet A-Å. There is no magic tool that will auto-correct every document that you encounter.

    So within this domain we know that a file might contain å with a UTF-8 2-byte representation 0xc3 0xa5 or Unicode, Latin-1 and Windows-1252 will represent it as 0xe5. Generally, this character lookup is very nice and might make for a good bookmark if you find yourself researching a character.

    Example

    • The Norwegian å
    • The corrupted version Ã¥

    You can find a long list of these kinds of issues in this handy debugging chart.

    Basic Python encode, decode

    This is the simplest way to hack the string back into shape if you know exactly what went wrong.

    our_broken_string = 'Ã¥'
    broken_unicode = our_broken_string.decode('UTF-8')
    print broken_unicode # u'\xc3\xa5' yikes -> two different unicode characters
    down_converted_string = broken_unicode.encode('LATIN-1')
    print down_converted_string # '\xc3\xa5' those are the right bytes
    correct_unicode = down_converted_string.decode('UTF-8')
    print correct_unicode # u'\xe5' correct unicode value
    

    Documents

    When working with documents there are some relatively good assumptions that can be made. Words, whitespace and lines. Even if the document is XML, you can still think about it as words and not really worry too much about the tags or if the words are really words, you just need the smallest unit you can find. We can also assume that if the file has text encoding problems, it probably has line-ending issues as well depending on how many different OSes mangled that file. I would break on line endings, rstrip, and recombine the array using print to a StringIO file handle.

    When preserving whitespace it might be tempting to run an XML document through a pretty-print function, but you shouldn't, we just want to correct the encoding of small text units without changing anything else. A good starting point is to see if you can get through the document line-by-line, word-by-word, NOT in arbitrary byte blocks and ignore the fact you're dealing with XML.

    Here I leverage the fact that you'll get UnicodeDecodeErrors if the text is out of range for UTF-8 and then attempt LATIN-1. This worked in this document.

    import unicodedata
    
    encoding_priority = ['UTF-8', 'LATIN-1']
    def clean_chunk(file_chunk):
        error_count = 0
        corrected_count = 0
        new_chunk = ''
        encoding = ''
        for encoding in encoding_priority:
            try:
                new_chunk = file_chunk.decode(encoding, errors='strict')
                corrected_count += 1
                break
            except UnicodeDecodeError, error:
                print('Input encoding %s failed -> %s' % (encoding, error))
                error_count += 1
        if encoding != '' and error_count > 0 and corrected_count > 0:
            print('Decoded. %s(%s) from hex(%s)' % (encoding, new_chunk, file_chunk.encode('HEX')))
    
        normalized = unicodedata.normalize('NFKC', new_chunk)
    
        return normalized, error_count, corrected_count
    
    
    def clean_document(document):
        cleaned_text = StringIO()
        error_count = 0
        corrected_count = 0
    
        for line in document:
            normalized_words = []
            words = line.rstrip().split(' ')
            for word in words:
                normalized_word, error_count, corrected_count = clean_chunk(word)
                error_count += error_count
                corrected_count += corrected_count
                normalized_words.append(normalized_word)
            normalized_line = ' '.join(normalized_words)
            encoded_line = normalized_line.encode(output_encoding)
            print(encoded_line, file=cleaned_text)
    
        cleaned_document = cleaned_text.getvalue()
        cleaned_text.close()
    
        return cleaned_document, error_count, corrected_count
    

    FTFY for dealing with Mojibake

    If your problem is real Mojibake, like perhaps a bad filename. You can use FTFY to try to heuristically correct your problem. Again, I'd take a word-by-word approach for best results.

    import os
    import sys
    import ftfy
    import unicodedata
    
    
    if __name__ == '__main__':
        path = sys.argv[1]
        file_system_encoding = sys.getfilesystemencoding()
        unicode_path = path.decode(file_system_encoding)
    
        for root, dirs, files in os.walk(unicode_path):
            for f in files:
                comparable_original_filename = unicodedata.normalize('NFC', f)
                comparable_new_filename = ftfy.fix_text(f, normalization='NFC')
    
                if comparable_original_filename != comparable_new_filename:
                    original_path = os.path.join(root, f)
                    new_path = os.path.join(root, comparable_new_filename)
                    print "Renaming:" + original_path + " to:" + new_path
                    os.rename(original_path, new_path)
    

    This went through the directory correcting much uglier errors where the å had been mangled into A\xcc\x83\xc2\xa5. What is this? The capital letter A + COMBINING LETTER TILDE 0xcc 0x83 is one of several ways to represent à using unicode equivalence. This is really a job for FTFY, because it will actually perform a heuristic and puzzle out these kinds of issues.

    Unicode normalization for comparison and filesystems

    Another way would be to use the normalization of unicode to get the correct bytes.

    import unicodedata
    
    a_combining_tilde = 'A\xcc\x83'
    # Assume: Expecting UTF-8 
    unicode_version = a_combining_tilde.decode('UTF-8') # u'A\u0303' and this cannot be converted to LATIN-1 and get Ã
    normalized = unicodedata.normalize('NFC', unicode_version) # u'\c3'
    broken_but_better = normalized.encode('UTF-8') # '\xc3\x83` correct UTF-8 bytes for Ã.
    

    So in summary, if you treated it as a UTF-8 encoded string A\xcc\x83\xc2\xa5, normalized it, and then down-converted to a LATIN-1 string and then back to UTF-8 you'd get the correct unicode back.

    You need to be mindful of how the OS encodes filenames. You can retrieve that information with:

    file_system_encoding = sys.getfilesystemencoding()
    

    So let's say file_system_encoding is UTF-8, great right? Then you compare two seemingly identical unicode strings and they're not equal! FTFY, by default normalizes to NFC, HFS normalizes to an older version of NFD. So, simply knowing the encoding is the same isn't good enough, you'll have to normalize in the same way for comparisons to be valid.

    • Windows NTFS stores unicode without normalization.
    • Linux stores unicode without normalization.
    • Mac HFS stores UTF-8 with a proprietary HFD normalization.

    Node.js has a good guide about dealing with different filesystems. In summary, normalize for comparison, don't arbitrarily renormalize filenames.

    Final notes

    Lies, damned lies, and XML declarations

    In XML documents you'll get something like this that is supposed to inform the XML parser about the text encoding.

    <?xml version="1.0" encoding="ISO-8859-1"?>
    

    If you see this, it should be treated as a lie until proven to be true. You need to validate and handle the encoding issues before handing this document to an XML parser and you need to correct the declaration.

    Lies, damned lies, and BOM markers

    Byte-order markers sound like a great idea, but like their XML declaration cousin are totally unreliable indicators of a files encoding situation. Within UTF-8, BOMs are NOT recommended and have no meaning meaning with respect to byte order. Their only value is to indicate that something is encoded in UTF-8. However, given the issues of text encoding the default is and should be to expect UTF-8.