python-3.x mediawiki wikipedia-api mediawiki-api

Why does this Wikipedia mediawiki api request not return categories to all links?

I'm trying to get all the outgoing links from a given wikipedia-page to other wikipedia articles and all their respective categories.

Somehow, many pages are returned w/o category even though they clearly belong to some. It even seems not to be systematic, i.e. the pages returned without category are not always the same.

The following example is as minimal as I can make it:

# -*- coding: utf-8 -*-
import urllib.request
import urllib.parse
import json

def link_request(more_parameters={"continue": ""}):
   parameters = {"format": "json",
                 "action": "query",
                 "generator": "links",
                 "gpllimit": "max",
                 "gplnamespace": "0",
                 "prop": "categories",
                 "cllimit": "max",
                 "titles": urllib.parse.quote(start_page.encode("utf8"))}
   parameters.update(more_parameters)

   queryString = "&".join("%s=%s" % (k, v) for k, v in parameters.items())

   # This ensures that redirects are followed automatically, documented here:
   # http://www.mediawiki.org/wiki/API:Query#Resolving_redirects
   queryString = queryString+"&redirects"

   url = "http://%s.wikipedia.org/w/api.php?%s" % (wikipedia_language, queryString)
   print(url)

   #get json data from wikimedia API and make a dictionary out of it:
   request = urllib.request.urlopen(url)
   encoding = request.headers.get_content_charset()
   jsonData = request.read().decode(encoding)
   data = json.loads(jsonData)

   return data

def get_link_data():
   data=link_request()

   query_result=data['query']['pages']

   while 'continue' in data.keys():
      continue_dict=dict()
      for key in list(data['continue'].keys()):
         if key == 'continue':
            continue_dict.update({key: data['continue'][key]})
         else:
            val= "|".join([urllib.parse.quote(e) for e in data['continue'][key].split('|')])
            continue_dict.update({key: val})
      data=link_request(continue_dict)
      query_result.update(data['query']['pages'])

   print(json.dumps(query_result, indent=4))

start_page="Albert Einstein"
wikipedia_language="en"
get_link_data()

In case someone is wondering: The continue stuff is explained here: http://www.mediawiki.org/wiki/API:Query#Continuing_queries

Solution

The problem is that because of the way the continuations work, you can't just update() the result and expect it to work.

For example, imagine you had the following linked pages with categories:

page 1
- category 1
page 2
- category 2A
- category 2B
page 3
- category 3

Now, if you set both gpllimit and cllimit to 2 (i.e. each response will contain at most two pages and at most two of their categories), the result will be across three continue responses like this:

response 1
- page 1
  - category 1
- page 2
  - category 2A
response 2
- page 1
- page 2
  - category 2B
response 3
- page 3
  - category 3

If you're going to use update() to combine these responses, the results from response 2 will overwrite results from response 1:

page 1
page 2
- category 2B
page 3
- category 3

So, what you need to do is to use a smarter approach to combine the responses. Or even better, use one of the existing libraries to access the API.