Search code examples
wikipedia-apimediawiki-apiwikidatawikimediawikimedia-dumps

Finding Interlanguage Related Articles from Wiki Dump


Finding the full list of Wikipedia's English articles with their related articles in languages other than English like French and Spanish is a problem that their is no answer about that. You can find some similar questions but most of them are related to previous structure of Wikipedia and the others have left without correct answer.

We can download the dump file of Wikipedia's English and Spanish articles from here: English Wiki and Spanish Wiki.

There is some data named langlinks aka sitelinks in enwiki and also eswiki with the aim to find interlanguage related articles. But it's not clear how to use them to find interlingual related articles(the Spanish article related to each English one). The langlinks schemas are like:

CREATE TABLE `langlinks` (
  `ll_from` int(10) unsigned NOT NULL DEFAULT '0',
  `ll_lang` varbinary(20) NOT NULL DEFAULT '',
  `ll_title` varbinary(255) NOT NULL DEFAULT '',
   UNIQUE KEY `ll_from` (`ll_from`,`ll_lang`),
   KEY `ll_lang` (`ll_lang`,`ll_title`)
) ENGINE=InnoDB DEFAULT CHARSET=binary;

Are the record with an special 'll_from' field in English related to record with similar 'll_from' field in Spanish? if yes, Why I can't find records with similar ll_from field in these two langlinks files?

Again, How to use these langlinks files to find interlanguage related articles? I dont want to use other tools like the Wikidata toolkit.


Solution

  • This page is helpful: Manual:langlinks table

    Fields ll_from page_id of the referring page.

    ll_lang Language code of the target, in the ISO 639-1 standard.

    ll_title Title of the target, including namespace (FULLPAGENAMEE style).

    As it showed in the schema, the combination of ll_lang and ll_title is unique.