Search code examples
pythonmediawikiwikipediawikipedia-apimediawiki-api

How to get wikipedia data of Wikiprojects?


I recently found that wikipedia has Wikiprojects that are categorised based on discipline (https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline). As shown in the link it has 34 disciplines.

I would like to know if it is possible to get all the wikipedia articles that is related to each of these wikipedia disciplines.

For example, consider WikiProject Computer science‎. Is it possible to get all the computer science related wikipedia articles using WikiProject Computer science‎ category? If so, are there any data dumps related to it or is there any other way to obtain these data?

I am currently using python (i.e. pywikibot and pymediawiki). However, I am happy to receive answers in other languages as well.

I am happy to provide more details if needed.


Solution

  • As I suggested and adding to @arash's answer, you can use the Wikipedia API to get the Wikipedia data. Here is the link with the description about how to do that, API:Categorymembers#GET_request

    As you commented that you need to fetch the data using program, below is the sample code in JavaScript. It will fetch the first 500 names from Category:WikiProject_Computer_science_articles and displays as output. You can convert the language of your choice based on this example:

    // Importing the module
    const fetch = require('node-fetch');
    
    // URL with resources to fetch
    const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
    
    // Fetching using 'node-fetch'
    fetch(url).then(res => res.json()).then(t => {
        // Getting the length of the returned array
        let len = t.query.categorymembers.length;
        // Iterating over all the response data
        for(let i=0;i<len;i++) {
            // Printing the names
            console.log(t.query.categorymembers[i].title);
        }
    });
    

    To write the data into a file, you can do like below :

    //Importing the modules
    const fetch = require('node-fetch');
    const fs = require('fs');
    
    //URL with resources to fetch
    const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
    
    //Fetching using 'node-fetch'
    fetch(url).then(res => res.json()).then(t => {
        // Getting the length of the returned array
        let len = t.query.categorymembers.length;
        // Initializing an empty array
        let titles = [];
        // Iterating over all the response data
        for(let i=0;i<len;i++) {
            // Printing the names
            let title = t.query.categorymembers[i].title;
            console.log(title);
            titles[i] = title;
        }
        fs.writeFileSync('pathtotitles\\titles.txt', titles);
    });
    

    The above one will store the data in a file with , separated because we using the JavaScript Array there. If you want to store in each line without commas then you need to do like this:

    //Importing the modules
    const fetch = require('node-fetch');
    const fs = require('fs');
    
    //URL with resources to fetch
    const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
    
    //Fetching using 'node-fetch'
    fetch(url).then(res => res.json()).then(t => {
        // Getting the length of the returned array
        let len = t.query.categorymembers.length;
        // Initializing an empty array
        let titles = '';
        // Iterating over all the response data
        for(let i=0;i<len;i++) {
            // Printing the names
            let title = t.query.categorymembers[i].title;
            console.log(title);
            titles += title + "\n";
        }
        fs.writeFileSync('pathtotitles\\titles.txt', titles);
    });
    

    By using the cmlimit, we can't fetch more than 500 titles so we need to use cmcontinue for checking and fetching the next pages...

    Try the below code which fetches all the titles of a particular category and prints, appends data to a file :

    //Importing the modules
    const fetch = require('node-fetch');
    const fs = require('fs');
    //URL with resources to fetch
    var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";
    
    // Method to fetch and append the data to a file 
    var fetchTheData = async (url, index) => {
        return await fetch(url).then(res => res.json()).then(data => {
            // Getting the length of the returned array
            let len = data.query.categorymembers.length;
            // Initializing an empty string
            let titles = '';
            // Iterating over all the response data
            for(let i=0;i<len;i++) {
                // Printing the names
                let title = data.query.categorymembers[i].title;
                console.log(title);
                titles += title + "\n";
            }
            // Appending to the file
            fs.appendFileSync('pathtotitles\\titles.txt', titles);
            // Handling an end of error fetching titles exception
            try {
                return data.continue.cmcontinue;
            } catch(err) {
                return "===>>> Finished Fetching...";
            }
        });
    }
    
    // Method which will construct the next URL with next page to fetch the data
    var constructNextPageURL = async (url) => {
        // Getting the next page token
        let nextPage = await fetchTheData(url);
        for(let i=1;i<=14;i++) {
            await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
            // Constructing the next page URL with next page token and sending the fetch request
            nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
        }
    }
    
    // Calling to begin extraction
    constructNextPageURL(url);
    

    I hope it helps...