Search code examples
htmlencodinglinguistics

What kind of encoding is this html encoding?


I am doing a project that involves searching words in the Arabic script on Wiktionary, and when I do a GET request on certain word pages, I get something like this for example: title="\xd8\xb1\xd8\xa3\xd8\xb3\xd9\x85\xd8\xa7\xd9\x84\xd9\x8a\xd8\xa9">\xd8\xb1\xd8\xa3\xd8\xb3\xd9\x85\xd8\xa7\xd9\x84\xd9\x8a\xd8\xa9</a></li>\n<li><a href="/wiki/%D8%B1%D8%A3%D8%B3%D9%8A"

This corresponds to the following URL: https://en.wiktionary.org/wiki/%D8%B1%D8%A3%D8%B3%D9%8A.

Does anyone know what the \xd8 or %D8 encodings are called? I want to say they are hex codes, but I have already looked up hex codes for the Arabic script and they certainly are not these.


Solution

  • The percentages you see in the url are used to substitute characters that are'nt allowed in URLs, such as special characters like "/", ":" and "&" and non ASCII characters. This is called percent encoding - https://en.m.wikipedia.org/wiki/Percent-encoding

    The "\xd.." prefixed represent hexadecimal character codes, since arabic characters fall outside of UTF-8 thats how that have to be represented. Thats assuming that HTML you showed used UTF-8 encoding.