When I copy paste this Wikipedia article it looks like this.
http://en.wikipedia.org/wiki/Gruy%C3%A8re_%28cheese%29
However if you paste this back into the URL address the percent signs disappear and what appears to be Unicode characters ( and maybe special URL characters ) take the place of the percent signs.
Are these abbreviations for Unicode and special URL characters?
I'm use to seeing \u00ff, etc. in JavaScript.
The reference you're looking for is RFC 3987: Internationalized Resource Identifiers, specifically the section on mapping IRIs to URIs.
RFC 3986: Uniform Resource Identifiers specifies that reserved characters must be percent-encoded, but it also specifies that percent-encoded characters are decoded to US-ASCII, which does not include characters such as è
.
RFC 3987 specifies that non-ASCII characters should first be encoded as UTF-8 so they can be percent-encoded as per RFC 3986. If you'll permit me to illustrate in Python:
>>> u'è'.encode('utf-8')
'\xc3\xa8'
Here I've asked Python to encode the Unicode è
to a string of bytes using UTF-8. The bytes returned are 0xc3
and 0xa8
. Percent-encoded, this looks like %C3%A8
.
The parentheses also appearing in your URL do fit in US-ASCII, so they are percent-escaped with their US-ASCII code points, which are also valid UTF-8.
So, no, there is no simple 16×16 table—such a table could never represent the richness of Unicode. But there is a method to the apparent madness.