Search code examples
urlfragment-identifierpercent-encodingurl-design

Why does Wikipedia use a modified percent encoding in their URL fragments?


I noticed that Wikipedia uses percent encoding for the path section of a URL, but converts the % character to . for the #fragment.

For example, on the Russian 'Russia' page, the URL for section 2 (История) is

http://ru.wikipedia.org/wiki/%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D1%8F#.D0.98.D1.81.D1.82.D0.BE.D1.80.D0.B8.D1.8F

instead of

http://ru.wikipedia.org/wiki/%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D1%8F#%D0%98%D1%81%D1%82%D0%BE%D1%80%D0%B8%D1%8F

Neither are valid HTML<5 tokens for an id/name as the token must start with [A-Za-z]. HTML5 currently states that you can use at least one of any characters apart from space (so you don't need to encode at all), but Wikipedia is not HTML5.

So, why has Wikipedia used this scheme?


Solution

  • One possible answer is cross-browser problems. Browsers are inconsistent in how they handle unicode, especially with URL fragments.

    For example, with the link

    <a id="foo" href="%D1%83%D0%BE%D0%BC%D0%B1%D0%BB%D1%8B">Уомблы</a>

    Browser      | Hover   | Location bar | href*   | path*
    ----------------------------------------------------------
    Chrome 19    | Unicode | Unicode      | Percent | Percent
    Firefox 13   | Unicode | Unicode      | Percent | Percent
    IE 9         | Percent | Percent      | Percent | Percent
    

    but with a fragment:

    <a id="foo" href="#%D1%83%D0%BE%D0%BC%D0%B1%D0%BB%D1%8B">Уомблы</a>

    Browser      | Hover   | Location bar | href*   | hash*
    ----------------------------------------------------------
    Chrome 19    | Percent | Percent      | Percent | Percent
    Firefox 13   | Unicode | Unicode      | Percent | Unicode
    IE 9         | Percent | Percent      | Percent | Percent
    

    href = javascript:document.getElementById('foo').href

    path = javascript:location.pathname after following link

    hash = javascript:location.hash after following link

    So Firefox will decode the fragment's percent-encoding to unicode when you ask for the hash, causing it to not match the id/name attribute's value. Note, this is only an issue in JavaScript; following links works fine.