Search code examples
pythoncharacter-encodingnon-ascii-characterscjkunescapestring

How to correctly unquote url which is supposed to contain Japanese symbols


I have the following string for example (which was built as I realized from incorrectly encoded string)

https://ja-jp.facebook.com/%C3%A5%C2%90%C2%8D%C3%A5%C2%8F%C2%A4%C3%A5%C2%B1%E2%80%B9%C3%AF%C2%BD%C5%A0%C3%AF%C2%BD%E2%80%99%C3%A3%E2%80%9A%C2%B2%C3%A3%C6%92%C2%BC%C3%A3%C6%92%CB%86%C3%A3%E2%80%9A%C2%BF%C3%A3%C6%92%C2%AF%C3%A3%C6%92%C2%BC%C3%A3%C6%92%E2%80%BA%C3%A3%C6%92%E2%80%A0%C3%A3%C6%92%C2%AB-219123305237478

This url could be properly decoded by browser showing the following:

https://ja-jp.facebook.com/名古屋jrゲートタワーホテル-219123305237478/

Is there a way to unquote/decode the string so it's not presented like this:

https://ja-jp.facebook.com/åå¤å±‹ï½Šï½’ゲートタワーホテル-219123305237478

Browser shows url with same rubbish initially for a short time, but then without redirect it adjustst the string so it looks fine.

I'm trying to fix encoding with this simple code:

def fix_encoding(s):
    for a in aliases:
        for b in aliases:
            try:
                fixed = s.encode(a).decode(b)
            except:
                pass
            else:
                print (a, b)
                print(fixed)

fix_encoding(u'åå¤å±‹ï½Šï½’ゲートタワーホテル-219123305237478')

The best results I've got are pretty close to what it should look like, but 2 first symbols are wrong for all same results. For ex.:

��屋jrゲートタワーホテル-219123305237478
('1252', 'l8')

Solution

  • I was going to delete this question, because it seems it's not meaningful, but since it has received some points, I'm sharing some thoughts on this.

    The url most probably was created in this way on Facebook due to non utf-8 encoded text that was copied from somewhere (or even with some bug on facebook in past). Some pages contain correctly encoded uri in scripts near updateURI property, that seems to be used by js to update url in browser address string.

    This url was probably created automatically where possible or maybe added manually, so old url from search engines is still respected. So it's most probably pointlessly to find an universal way of fixing such bugs.