I have a lot of XML documents and filenames for external files that have various forms of text corruption or Mojibake causing data quality problems during import. I've read a number of different posts on StackOverflow on correcting strings, but they fail to really outline how to cleanup text in a systematic way and python's decode
, encode
don't seem to be helping. How can I recover an XML file and filenames, using Python 2.7, containing characters in the range of Latin-1 (ISO-8859-1) but have mixed encodings generally?
If you can't make assumptions about the kinds of letters that you'll be encountering, you're probably in trouble. So it is good that in our document we can reasonably assume the Norwegian alphabet A-Å
. There is no magic tool that will auto-correct every document that you encounter.
So within this domain we know that a file might contain å
with a UTF-8 2-byte representation 0xc3 0xa5
or Unicode, Latin-1 and Windows-1252 will represent it as 0xe5
. Generally, this character lookup is very nice and might make for a good bookmark if you find yourself researching a character.
å
Ã¥
You can find a long list of these kinds of issues in this handy debugging chart.
This is the simplest way to hack the string back into shape if you know exactly what went wrong.
our_broken_string = 'Ã¥'
broken_unicode = our_broken_string.decode('UTF-8')
print broken_unicode # u'\xc3\xa5' yikes -> two different unicode characters
down_converted_string = broken_unicode.encode('LATIN-1')
print down_converted_string # '\xc3\xa5' those are the right bytes
correct_unicode = down_converted_string.decode('UTF-8')
print correct_unicode # u'\xe5' correct unicode value
When working with documents there are some relatively good assumptions that can be made. Words, whitespace and lines. Even if the document is XML, you can still think about it as words and not really worry too much about the tags or if the words are really words, you just need the smallest unit you can find. We can also assume that if the file has text encoding problems, it probably has line-ending issues as well depending on how many different OSes mangled that file. I would break on line endings, rstrip
, and recombine the array using print to a StringIO
file handle.
When preserving whitespace it might be tempting to run an XML document through a pretty-print function, but you shouldn't, we just want to correct the encoding of small text units without changing anything else. A good starting point is to see if you can get through the document line-by-line, word-by-word, NOT in arbitrary byte blocks and ignore the fact you're dealing with XML.
Here I leverage the fact that you'll get UnicodeDecodeErrors if the text is out of range for UTF-8 and then attempt LATIN-1. This worked in this document.
import unicodedata
encoding_priority = ['UTF-8', 'LATIN-1']
def clean_chunk(file_chunk):
error_count = 0
corrected_count = 0
new_chunk = ''
encoding = ''
for encoding in encoding_priority:
try:
new_chunk = file_chunk.decode(encoding, errors='strict')
corrected_count += 1
break
except UnicodeDecodeError, error:
print('Input encoding %s failed -> %s' % (encoding, error))
error_count += 1
if encoding != '' and error_count > 0 and corrected_count > 0:
print('Decoded. %s(%s) from hex(%s)' % (encoding, new_chunk, file_chunk.encode('HEX')))
normalized = unicodedata.normalize('NFKC', new_chunk)
return normalized, error_count, corrected_count
def clean_document(document):
cleaned_text = StringIO()
error_count = 0
corrected_count = 0
for line in document:
normalized_words = []
words = line.rstrip().split(' ')
for word in words:
normalized_word, error_count, corrected_count = clean_chunk(word)
error_count += error_count
corrected_count += corrected_count
normalized_words.append(normalized_word)
normalized_line = ' '.join(normalized_words)
encoded_line = normalized_line.encode(output_encoding)
print(encoded_line, file=cleaned_text)
cleaned_document = cleaned_text.getvalue()
cleaned_text.close()
return cleaned_document, error_count, corrected_count
If your problem is real Mojibake, like perhaps a bad filename. You can use FTFY to try to heuristically correct your problem. Again, I'd take a word-by-word approach for best results.
import os
import sys
import ftfy
import unicodedata
if __name__ == '__main__':
path = sys.argv[1]
file_system_encoding = sys.getfilesystemencoding()
unicode_path = path.decode(file_system_encoding)
for root, dirs, files in os.walk(unicode_path):
for f in files:
comparable_original_filename = unicodedata.normalize('NFC', f)
comparable_new_filename = ftfy.fix_text(f, normalization='NFC')
if comparable_original_filename != comparable_new_filename:
original_path = os.path.join(root, f)
new_path = os.path.join(root, comparable_new_filename)
print "Renaming:" + original_path + " to:" + new_path
os.rename(original_path, new_path)
This went through the directory correcting much uglier errors where the å
had been mangled into A\xcc\x83\xc2\xa5
. What is this? The capital letter A
+ COMBINING LETTER TILDE
0xcc 0x83 is one of several ways to represent Ã
using unicode equivalence. This is really a job for FTFY, because it will actually perform a heuristic and puzzle out these kinds of issues.
Another way would be to use the normalization of unicode to get the correct bytes.
import unicodedata
a_combining_tilde = 'A\xcc\x83'
# Assume: Expecting UTF-8
unicode_version = a_combining_tilde.decode('UTF-8') # u'A\u0303' and this cannot be converted to LATIN-1 and get Ã
normalized = unicodedata.normalize('NFC', unicode_version) # u'\c3'
broken_but_better = normalized.encode('UTF-8') # '\xc3\x83` correct UTF-8 bytes for Ã.
So in summary, if you treated it as a UTF-8 encoded string A\xcc\x83\xc2\xa5
, normalized it, and then down-converted to a LATIN-1 string and then back to UTF-8 you'd get the correct unicode back.
You need to be mindful of how the OS encodes filenames. You can retrieve that information with:
file_system_encoding = sys.getfilesystemencoding()
So let's say file_system_encoding
is UTF-8
, great right? Then you compare two seemingly identical unicode strings and they're not equal! FTFY, by default normalizes to NFC
, HFS normalizes to an older version of NFD
. So, simply knowing the encoding is the same isn't good enough, you'll have to normalize in the same way for comparisons to be valid.
Node.js has a good guide about dealing with different filesystems. In summary, normalize for comparison, don't arbitrarily renormalize filenames.
In XML documents you'll get something like this that is supposed to inform the XML parser about the text encoding.
<?xml version="1.0" encoding="ISO-8859-1"?>
If you see this, it should be treated as a lie until proven to be true. You need to validate and handle the encoding issues before handing this document to an XML parser and you need to correct the declaration.
Byte-order markers sound like a great idea, but like their XML declaration cousin are totally unreliable indicators of a files encoding situation. Within UTF-8, BOMs are NOT recommended and have no meaning meaning with respect to byte order. Their only value is to indicate that something is encoded in UTF-8. However, given the issues of text encoding the default is and should be to expect UTF-8.