Search code examples
mysqldumpwikipediawikipedia-api

Wikipedia: dump article id's and it's category


I would like to make a mysql database with every wikipedia article id and it's category id (most general category). I saw that wikipedia gives an entire dump, and a few others like links between categories. Also I saw there is mediawiki but I can't manage to find the right query to send.

But nonetheless I can't find how to dump a big file with the article id's and it's category id. How should I do it? How much data should I expect?


Solution

  • Wikipedia provides dumps of most of its data. The one that you want is categorylinks.sql, which contains list of category names (categories don't have IDs) for each article ID. You will also most likely want page.sql, which contains map from article ID to its title.

    To work with the dumps, you can import them into a local MySQL database, or you could use a library that parses the dumps directly, like the one I wrote for .Net.

    But each article is usually in several categories and there is no notion of primary category or anything like that. So, if you really want just one category for each article, you will have to figure out how to do that by yourself.