Search code examples
mediawikiwikipediawikipedia-apimediawiki-api

How to obtain a list of titles of all Wikipedia articles


I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.

I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.

So my question is

  1. Is there a way to obtain only the titles of Wikipedia articles via the API?
  2. Is there a way to combine multiple request/queries into one? Or do I actually have to download a Wikipedia dump?

Solution

  • The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.

    But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).