Search code examples
pythonunicodeencodingpython-requests

Python fails to decode EUC-JP strings with the "㎝" character


I've run into an odd issue with encoding. When viewing the pages with chrome, they render as expected and even can be saved without issue, but when saved via requests or urllib, the resultant files are corrupt. These happen specifically on pages with the "㎝" character, and result in not just a single instance of \uFFFD (�), but the resultant corruption of subsequent characters as well.

E.g: サイズ:XL 約77㎝×約58㎝ -> サイズ:XL 約77�僉潴�58��<br>

This was sourced from this page

My attempts at encoding with EUC-JP, and the like have failed and I'm at a bit of a loss as to what the root cause might be here.

Here's an example with the problematic bytes from the site:

content = b"\xa5\xb5\xa5\xa4\xa5\xba\xa1\xa7XL \xcc\xf377\xad\xd1\xa1\xdf\xcc\xf358\xad\xd1"
text = content.decode("EUC-JP")
print(text)

This should print サイズ:XL 約77㎝×約58㎝, but it throws:

Traceback (most recent call last):
  File "<pyshell#53>", line 1, in <module>
    text = content.decode("EUC-JP")
UnicodeDecodeError: 'euc_jp' codec can't decode byte 0xad in position 15: illegal multibyte sequence

Solution

  • Looks like the actual encoding is "EUC-JISx0213" or "EUC-JIS-2004", as this code works:

    content = b"\xa5\xb5\xa5\xa4\xa5\xba\xa1\xa7XL \xcc\xf377\xad\xd1\xa1\xdf\xcc\xf358\xad\xd1"
    text = content.decode("euc_jis_2004")
    print(text)
    text = content.decode("euc_jisx0213")
    print(text)
    

    From Wikipedia on EUC-JP:

    A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004, encodes JIS X 0201 and JIS X 0213

    But "㎝" is part of the extended character set "JIS X 0208", which "EUC-JS" should support, but apparently not the extension.

    Note: If you just re-encode the page and save, them the browser will not show it properly as the page is still marked as "EUC-JP" in the meta tag.