Search code examples
pythonwikipedia

Problem accessing Wikipedia page using Python API


I want to extract image URLs from Wikipedia pages. I'm using the Wikipedia python API for that. I'm having some problems accessing some pages and I cannot understand what is wrong.

I'm using the Apple Inc. Wikipedia page as an example. The page title is Apple Inc. and the page name in the URL is Apple_Inc. (maybe there's a better name for that).

If I use the wikipedia.page() function to access the page with the title Apple Inc., I get the error: Page id "apple in" does not match any pages. Try another id!. Same if I use the title "Apple_Inc. instead. But if I use something close to the title, the API often gives me the correct page: <WikipediaPage 'Apple Inc.'>. See the code bellow and the resulting page /error:

import wikipedia

page = wikipedia.page(title="Apple Inc.")
print(page)
# -> wikipedia.exceptions.PageError: Page id "apple in" does not match any pages. Try another id!

page = wikipedia.page(title="Apple_Inc.")
print(page)
# -> wikipedia.exceptions.PageError: Page id "apple in" does not match any pages. Try another id!

page = wikipedia.page(title="Apple Inc")
print(page)
# -> <WikipediaPage 'Apple Inc.'>

page = wikipedia.page(title="Apple In")
print(page)
# -> <WikipediaPage 'Apple Inc.'>

page = wikipedia.page(title="Apple Incorporated")
print(page)
# -> <WikipediaPage 'Apple Inc.'>

page = wikipedia.page(title="Apple Incorporated.")
print(page)
# -> <WikipediaPage 'Apple Inc.'>

page = wikipedia.page(title="Apple Incc.")
print(page)
# -> wikipedia.exceptions.PageError: Page id "apple inch" does not match any pages. Try another id!

At first, I thought it was the "." in "Apple Inc." that would cause a problem, but the title "Apple Incorporated." works fine. And strangely, if I use the title "Apple Incc.", then it seem that the API is looking for the page "apple inch" for some reason.


Solution

  • According to the documentation, The suggestion will be the first thing that is selected as title, If it is None, then the first result's title will be selected. After that an object of WikipediaPage is created and returned.

    try:
        title = suggestion or results[0]
    except IndexError:
        # if there is no suggestion or search results, the page doesn't exist
        raise PageError(title)
    return WikipediaPage(title, redirect=redirect, preload=preload)
    

    If we look deeper in the documentation, We will se that if there is no searchinfo in the received json, then suggestion will be None. But if there is a searchinfo in json, its title will be selected and returned.

    search_results = (d['title'] for d in raw_results['query']['search'])
    
      if suggestion:
          if raw_results['query'].get('searchinfo'):
              return list(search_results), raw_results['query']['searchinfo']['suggestion']
          else:
              return list(search_results), None
    
      return list(search_results)
    

    You might wonder how the received json looks like, it looks like the json bellow:

    {
      "warnings": {
        "main": {
          "*": "Unrecognized parameter: limit."
        }
      },
      "batchcomplete": "",
      "continue": {
        "sroffset": 1,
        "continue": "-||"
      },
      "query": {
        "searchinfo": {
          "totalhits": 23251,
          "suggestion": "apple in",
          "suggestionsnippet": "apple in"
        },
        "search": [
          {
            "ns": 0,
            "title": "Apple Inc.",
            "pageid": 856
          }
        ]
      }
    }
    

    Now that we know all these about the documentation, It's time to answer the original problem. All you need to do is to pass an argument named auto_suggest to be False.

    import wikipedia
    
    page = wikipedia.page(title="Apple Inc.",auto_suggest=False)
    print(page)
    # you'll see <WikipediaPage 'Apple Inc.'>
    

    Where according to the documentation means let Wikipedia find a valid page title for the query.

    • title - the title of the page to load
    • pageid - the numeric pageid of the page to load
    • auto_suggest - let Wikipedia find a valid page title for the query
    • redirect - allow redirection without raising RedirectError
    • preload - load content, summary, images, references, and links during initialization