I'm trying to get all the outgoing links from a given wikipedia-page to other wikipedia articles and all their respective categories.
Somehow, many pages are returned w/o category even though they clearly belong to some. It even seems not to be systematic, i.e. the pages returned without category are not always the same.
The following example is as minimal as I can make it:
# -*- coding: utf-8 -*-
import urllib.request
import urllib.parse
import json
def link_request(more_parameters={"continue": ""}):
parameters = {"format": "json",
"action": "query",
"generator": "links",
"gpllimit": "max",
"gplnamespace": "0",
"prop": "categories",
"cllimit": "max",
"titles": urllib.parse.quote(start_page.encode("utf8"))}
parameters.update(more_parameters)
queryString = "&".join("%s=%s" % (k, v) for k, v in parameters.items())
# This ensures that redirects are followed automatically, documented here:
# http://www.mediawiki.org/wiki/API:Query#Resolving_redirects
queryString = queryString+"&redirects"
url = "http://%s.wikipedia.org/w/api.php?%s" % (wikipedia_language, queryString)
print(url)
#get json data from wikimedia API and make a dictionary out of it:
request = urllib.request.urlopen(url)
encoding = request.headers.get_content_charset()
jsonData = request.read().decode(encoding)
data = json.loads(jsonData)
return data
def get_link_data():
data=link_request()
query_result=data['query']['pages']
while 'continue' in data.keys():
continue_dict=dict()
for key in list(data['continue'].keys()):
if key == 'continue':
continue_dict.update({key: data['continue'][key]})
else:
val= "|".join([urllib.parse.quote(e) for e in data['continue'][key].split('|')])
continue_dict.update({key: val})
data=link_request(continue_dict)
query_result.update(data['query']['pages'])
print(json.dumps(query_result, indent=4))
start_page="Albert Einstein"
wikipedia_language="en"
get_link_data()
In case someone is wondering: The continue stuff is explained here: http://www.mediawiki.org/wiki/API:Query#Continuing_queries
The problem is that because of the way the continuations work, you can't just update()
the result and expect it to work.
For example, imagine you had the following linked pages with categories:
Now, if you set both gpllimit
and cllimit
to 2 (i.e. each response will contain at most two pages and at most two of their categories), the result will be across three continue responses like this:
If you're going to use update()
to combine these responses, the results from response 2 will overwrite results from response 1:
So, what you need to do is to use a smarter approach to combine the responses. Or even better, use one of the existing libraries to access the API.