Search code examples
wikidumpmediawiki-api

How to export dumps from a Wiki


I have been searching how to crawl some wikis (namely https://fr.vikidia.org/ and https://fr.wikimini.org/) to create plain text corpus for NLP.

As far as I understood, doing this for a Wikipedia this is usually done by downloading dumps from https://dumps.wikimedia.org/ and using a parser tool such as WikiExtractor, but it seems that I cannot get dumps from these wikis on the dump website, is it right?

Following the Help:Export page of MediawikiAPI, I have found two partial answers so far:

1) Configure MediawikiAPI for these wikis and use the script listpages.py with the option -search

Problem: I get the content of 10,000 pages saved in one file for each article at once, but this content is saved as formatted text with template and not as XML, which makes it non usable for WikiExtractor, so I could not access the plain text here.

2) Follow these instructions to get a list of names of pages from the page Special:Allpages of each wiki, paste them in their page Special:Export and generate an XML dump

Problem: this time I get a format correctly parsed by WikiExtractor resulting in plain text, but I need to reproduce this operation for the hundreds of pages Special:Allpages of each wiki, which is not practical at all.

Do you know how I could manage to go from wikis to plain text?


Solution

  • Use the export API with the allpages generator: https://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=10&format=jsonfm&formatversion=2&export