Search code examples
urlwikipediaurl-encodingmining

Determining Exact URL from a link inside wiki text


In a wikipedia's article text, a link might be mentioned like: [Category:A B C], however the exact wiki url will have suffix like Category:A_B_C From where I can get the information regarding all these rules which wiki uses to get the url from a link in its text ?(, e.g. converting spaces to underscores, capitalizing first letter, dealing with non-ascii characters etc)


Solution

  • Roughly the following:

    • Normalize namespace, e.g. category: --> Category:.
    • Uppercase the first letter of title proper, e.g. Category:foo --> Category:Foo. Note: this depends on wiki settings and titles are never uppercased on Wiktionary, for example.
    • Replace spaces with underscores, e.g. Foo bar --> Foo_bar.
    • Percent-encode all the usual characters with PHP's standard function urlencode(), except for the following ones: ;:@$!*(),/.

    For full technical details you could look up this (function getLocalUrl()) and this (function wfUrlencode()).