Search code examples
wikipedia

Easy way to export Wikipedia's translated titles


Is there an easy way to export Wikipedia's translated titles to get a set like this:
russian_title -> english_title?

I tried to get ones from ruwiki-latest-pages-meta-current.xml.bz2 and ruwiki-latest-pages-articles.xml.bz2, however, there are less than 25k translations.

I found out some are not present. E.g. one can see a link to English wiki here, but there is no link [[en:Yandex]] in the dump.

Maybe I should try to parse English Wikipedia, but I'm sure there is a nicer solution.

BTW, I'm using wikixmlj + tried to find en:Yandex with grep.

UPD: link to @svick's solution data: http://dumps.wikimedia.org/ [language code] wiki/latest/ e.g. http://dumps.wikimedia.org/ruwiki/latest/


Solution

  • Most of the links between Wikipedia articles in various languages is now on Wikidata. So, if you wanted to get to the source, you could download the dump of Wikidata and parse that (it's in JSON).

    But I think a better way would be to use the dump of the langlinks table. This contains exactly the information you want, both for links from Wikidata and links that are still in the old form.

    This dump is in SQL format. You can import that dump into an MySQL database, or you can parse it directly (I have written a .Net library that does that).

    The table contains mappings from page id of your wiki (in your case the Russian Wikipedia) to page titles in other wikis. This means you will need the page ids of the pages you're interested in. For small number of pages, you can look them up manually using the “Page information” link, or you could use the API. But if you need this for large number of pages, you should download the dump of the page table, which contains this mapping.