So I'm trying to scrape all the subcategories and pages under the category header of the Category page: "Category: Class-based programming languages" found at:
https://en.wikipedia.org/wiki/Category:Class-based_programming_languages
I've figured out a way to do this using urls and the mediawiki API: Categorymembers. The way to do that would be:
en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500
en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat
However, I can't find a way to accomplish this using Python. Can anyone help me out here?
This is for independent study and I've spent a lot of time on this, but just can't seem to figure it out. Also, the use of Beautifulsoup is prohibited. Thank you for all the help!
Ok so after doing more research and study, I was able to find an answer to my own question. Using the libraries urllib.request
and json
, I imported the wikipedia url file in format json and simply printed its categories out that way. Here's the code I used to get the subcategories:
pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat")
data = json.load(pages)
query = data['query']
category = query['categorymembers']
for x in category:
print (x['title'])
And you can do the same thing for pages in category. Thanks to Nemo for trying to help me out!