Search code examples
pythonpython-2.7unicodemojibake

Unicode Normalization


Is there a possible normalization path which brings both strings below to same value?

  • u'Aho\xe2\u20ac\u201cCorasick_string_matching_algorithm'
  • u'Aho\u2013Corasick string matching algorithm'

Solution

  • It looks like you have a Mojibake there, UTF-8 bytes that have been decoded as if they were Windows-1252 data instead. Your 3 'characters', encoded to Windows-1252, produce the exact 3 UTF-8 bytes for the U+2013 EN DASH character in your target string:

    >>> u'\u2013'.encode('utf8')
    '\xe2\x80\x93'
    >>> u'\u2013'.encode('utf8').decode('windows-1252')
    u'\xe2\u20ac\u201c'
    

    You can use the ftfy module to repair that data, so you get an emdash for the bytes:

    >>> import ftfy
    >>> sample = u'Aho\xe2\u20ac\u201cCorasick_string_matching_algorithm'
    >>> ftfy.fix_text(sample)
    u'Aho\u2013Corasick_string_matching_algorithm'
    

    then simply replace underscores with spaces:

    >>> ftfy.fix_text(sample).replace('_', ' ')
    u'Aho\u2013Corasick string matching algorithm'
    

    You can also simply encode to Windows-1252 and decode again as UTF-8, but that doesn't always work because there are specific bytes that cannot be decoded legally as Windows-1252, but some systems producing these Mojibakes do so anyway. ftfy includes specialised repair codecs to reverse that process. In addition, it detects the specific Mojibake errors made to automate the process across multiple possible codec errors.