Search code examples
pythonjsonweb-scrapingwikipedia

How to scrape Subcategories and pages in categories of a Category wikipedia page using Python


So I'm trying to scrape all the subcategories and pages under the category header of the Category page: "Category: Class-based programming languages" found at:

https://en.wikipedia.org/wiki/Category:Class-based_programming_languages

I've figured out a way to do this using urls and the mediawiki API: Categorymembers. The way to do that would be:

  • base: en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500
  • base: en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat

However, I can't find a way to accomplish this using Python. Can anyone help me out here?

This is for independent study and I've spent a lot of time on this, but just can't seem to figure it out. Also, the use of Beautifulsoup is prohibited. Thank you for all the help!


Solution

  • Ok so after doing more research and study, I was able to find an answer to my own question. Using the libraries urllib.request and json, I imported the wikipedia url file in format json and simply printed its categories out that way. Here's the code I used to get the subcategories:

    pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat")
    data = json.load(pages)
    query = data['query']
    category = query['categorymembers']
    for x in category:
        print (x['title'])
    

    And you can do the same thing for pages in category. Thanks to Nemo for trying to help me out!