python python-3.x unicode python-unicode unicode-normalization

Some annoying characters are not normalised by unicodedata

I have a python string that looks like as shown below. This string is from the SEC filing of one public company in the US. I am trying to remove some annoying characters from the string using unicodedata.normalise function, but this is not removing all characters. What could be the reason behind such behavior?

from unicodedata import normalize
s = 'GTS.Client.Services@JPMChase.com\nFacsimile\nNo.:\xa0 312-233-2266\n\xa0\nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:\xa0 Hiral Patel\nFacsimile No.:\xa0 312-385-7096\n\xa0\nLadies and Gentlemen:\n\xa0\nReference is made to the\nCredit Agreement, dated as of May\xa07, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \x93Credit Agreement\x94), by and among\nHawaiian Electric Industries,\xa0Inc., a Hawaii corporation (the \x93Borrower\x94), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \x93Administrative Agent\x94).'

normalize('NFKC', s)
'GTS.Client.Services@JPMChase.com\nFacsimile\nNo.:  312-233-2266\n \nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:  Hiral Patel\nFacsimile No.:  312-385-7096\n \nLadies and Gentlemen:\n \nReference is made to the\nCredit Agreement, dated as of May 7, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \x93Credit Agreement\x94), by and among\nHawaiian Electric Industries, Inc., a Hawaii corporation (the \x93Borrower\x94), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \x93Administrative Agent\x94).'

As one can see from the outputs, the characters \xa0 is handled properly, but the characters like \x92, \x93 and \x94 are not normalized and are as it is in the result string.

Solution

Your data was decoded as ISO-8859-1 (aka latin1), but those Unicode code points are control characters in that encoding. In Windows-1252 (aka cp1252) they are so-called smart quotes:

>>> '\x92\x93\x94'.encode('latin1').decode('cp1252')
'’“”'

They also don't change when normalized, but at least they display correctly if decoded properly:

>>> ud.normalize('NFKC','\x92\x93\x94'.encode('latin1').decode('cp1252'))
'’“”'
>>> print(s.encode('latin1').decode('cp1252'))
GTS.Client.Services@JPMChase.com
Facsimile
No.:  312-233-2266
 
JPMorgan Chase Bank,
N.A., as Administrative Agent
10 South Dearborn, Floor 7th
IL1-0010
Chicago, IL 60603-2003
Attention:  Hiral Patel
Facsimile No.:  312-385-7096
 
Ladies and Gentlemen:
 
Reference is made to the
Credit Agreement, dated as of May 7, 2010 (as the same may be amended,
restated, supplemented or otherwise modified from time to time, the “Credit Agreement”), by and among
Hawaiian Electric Industries, Inc., a Hawaii corporation (the “Borrower”), the Lenders from time to
time party thereto and JPMorgan Chase Bank, N.A., as issuing bank and
administrative agent (the “Administrative Agent”).

Note the \xa0 code point is U+00A0 (NO-BREAK SPACE) and canonically normalizes to a SPACE:

>>> ud.name('\xa0')
'NO-BREAK SPACE'
>>> ud.normalize('NFKC','\xa0')
' '
>>> ud.name(ud.normalize('NFKC','\xa0'))
'SPACE'

It prints correctly without normalization:

>>> print('hello\xa0there')
hello there