Search code examples
pythonpython-3.xunicodepython-unicodeunicode-normalization

Some annoying characters are not normalised by unicodedata


I have a python string that looks like as shown below. This string is from the SEC filing of one public company in the US. I am trying to remove some annoying characters from the string using unicodedata.normalise function, but this is not removing all characters. What could be the reason behind such behavior?

from unicodedata import normalize
s = '[email protected]\nFacsimile\nNo.:\xa0 312-233-2266\n\xa0\nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:\xa0 Hiral Patel\nFacsimile No.:\xa0 312-385-7096\n\xa0\nLadies and Gentlemen:\n\xa0\nReference is made to the\nCredit Agreement, dated as of May\xa07, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \x93Credit Agreement\x94), by and among\nHawaiian Electric Industries,\xa0Inc., a Hawaii corporation (the \x93Borrower\x94), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \x93Administrative Agent\x94).'

normalize('NFKC', s)
'[email protected]\nFacsimile\nNo.:  312-233-2266\n \nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:  Hiral Patel\nFacsimile No.:  312-385-7096\n \nLadies and Gentlemen:\n \nReference is made to the\nCredit Agreement, dated as of May 7, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \x93Credit Agreement\x94), by and among\nHawaiian Electric Industries, Inc., a Hawaii corporation (the \x93Borrower\x94), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \x93Administrative Agent\x94).'

As one can see from the outputs, the characters \xa0 is handled properly, but the characters like \x92, \x93 and \x94 are not normalized and are as it is in the result string.


Solution

  • Your data was decoded as ISO-8859-1 (aka latin1), but those Unicode code points are control characters in that encoding. In Windows-1252 (aka cp1252) they are so-called smart quotes:

    >>> '\x92\x93\x94'.encode('latin1').decode('cp1252')
    '’“”'
    

    They also don't change when normalized, but at least they display correctly if decoded properly:

    >>> ud.normalize('NFKC','\x92\x93\x94'.encode('latin1').decode('cp1252'))
    '’“”'
    >>> print(s.encode('latin1').decode('cp1252'))
    [email protected]
    Facsimile
    No.:  312-233-2266
     
    JPMorgan Chase Bank,
    N.A., as Administrative Agent
    10 South Dearborn, Floor 7th
    IL1-0010
    Chicago, IL 60603-2003
    Attention:  Hiral Patel
    Facsimile No.:  312-385-7096
     
    Ladies and Gentlemen:
     
    Reference is made to the
    Credit Agreement, dated as of May 7, 2010 (as the same may be amended,
    restated, supplemented or otherwise modified from time to time, the “Credit Agreement”), by and among
    Hawaiian Electric Industries, Inc., a Hawaii corporation (the “Borrower”), the Lenders from time to
    time party thereto and JPMorgan Chase Bank, N.A., as issuing bank and
    administrative agent (the “Administrative Agent”).
    

    Note the \xa0 code point is U+00A0 (NO-BREAK SPACE) and canonically normalizes to a SPACE:

    >>> ud.name('\xa0')
    'NO-BREAK SPACE'
    >>> ud.normalize('NFKC','\xa0')
    ' '
    >>> ud.name(ud.normalize('NFKC','\xa0'))
    'SPACE'
    

    It prints correctly without normalization:

    >>> print('hello\xa0there')
    hello there