Search code examples
mediawikiwikipediawikipedia-apiwikidatamediawiki-api

How to Get the Pageids and Titles of All Wikipedia's Content Pages Through MediaWiki API?


Wikipedia Statistics

The link above shows that there are nearly 6 million Content Pages of English Wikipedia. How can I use MediaWiki API to get all content pages' pageids and titles?

params = {
    'action': 'query',
    'list': 'allpages',
    'gapfilterredir': 'nonredirects',
    'apnamespace': 0,
    'aplimit': 500,
    'format': 'json'
}

I have tried this API format, though I set 'gapfilterredir' as 'nonredirects', there are still some redirect pages, and scraped items are much more than 6 million.


Solution

  • Preferably via dumps, but if you really want to use the API, what you have shown is the right way to do it. The statistics exclude certain very short pages (pages with no internal link or period, if I remember correctly), that's why the numbers are different.