python mediawiki wikipedia wikipedia-api mediawiki-api

How to get wikipedia data of Wikiprojects?

I recently found that wikipedia has Wikiprojects that are categorised based on discipline (https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline). As shown in the link it has 34 disciplines.

I would like to know if it is possible to get all the wikipedia articles that is related to each of these wikipedia disciplines.

For example, consider WikiProject Computer science‎. Is it possible to get all the computer science related wikipedia articles using WikiProject Computer science‎ category? If so, are there any data dumps related to it or is there any other way to obtain these data?

I am currently using python (i.e. pywikibot and pymediawiki). However, I am happy to receive answers in other languages as well.

I am happy to provide more details if needed.

Solution

As I suggested and adding to @arash's answer, you can use the Wikipedia API to get the Wikipedia data. Here is the link with the description about how to do that, API:Categorymembers#GET_request

As you commented that you need to fetch the data using program, below is the sample code in JavaScript. It will fetch the first 500 names from Category:WikiProject_Computer_science_articles and displays as output. You can convert the language of your choice based on this example:

// Importing the module
const fetch = require('node-fetch');

// URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

// Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        console.log(t.query.categorymembers[i].title);
    }
});

To write the data into a file, you can do like below :

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Initializing an empty array
    let titles = [];
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        let title = t.query.categorymembers[i].title;
        console.log(title);
        titles[i] = title;
    }
    fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

The above one will store the data in a file with , separated because we using the JavaScript Array there. If you want to store in each line without commas then you need to do like this:

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Initializing an empty array
    let titles = '';
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        let title = t.query.categorymembers[i].title;
        console.log(title);
        titles += title + "\n";
    }
    fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

By using the cmlimit, we can't fetch more than 500 titles so we need to use cmcontinue for checking and fetching the next pages...

Try the below code which fetches all the titles of a particular category and prints, appends data to a file :

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";

// Method to fetch and append the data to a file 
var fetchTheData = async (url, index) => {
    return await fetch(url).then(res => res.json()).then(data => {
        // Getting the length of the returned array
        let len = data.query.categorymembers.length;
        // Initializing an empty string
        let titles = '';
        // Iterating over all the response data
        for(let i=0;i<len;i++) {
            // Printing the names
            let title = data.query.categorymembers[i].title;
            console.log(title);
            titles += title + "\n";
        }
        // Appending to the file
        fs.appendFileSync('pathtotitles\\titles.txt', titles);
        // Handling an end of error fetching titles exception
        try {
            return data.continue.cmcontinue;
        } catch(err) {
            return "===>>> Finished Fetching...";
        }
    });
}

// Method which will construct the next URL with next page to fetch the data
var constructNextPageURL = async (url) => {
    // Getting the next page token
    let nextPage = await fetchTheData(url);
    for(let i=1;i<=14;i++) {
        await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
        // Constructing the next page URL with next page token and sending the fetch request
        nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
    }
}

// Calling to begin extraction
constructNextPageURL(url);

I hope it helps...