python beautifulsoup wikipedia wikipedia-api mediawiki-api

Automatically resolving disambiguation pages

The problem

I'm using the Wikipedia API to get page HTML which I parse. I use queries like this one to get the HTML for the first section of a page.

The MediaWiki API provides a handy parameter, redirects, which will cause the API to automatically follow pages that redirect other pages. For example, if I search for 'Cats' with https://en.wikipedia.org/w/api.php?page=Cats&redirects, I will be shown the results for Cat because Cats redirects to Cat.

I'd like a similar function for disambiguation pages such as this, by which if I arrive at a disambiguation page, I am automatically redirected to the first link. For example, if I make a request to a page like Mercury, I'd automatically be redirected to Mercury (element), as it is the first link listed in the page.

The Python HTML parser BeautifulSoup is fairly slow on large documents. By only requesting the first section of articles (that's all I need for my use), using section=0, I can parse it quickly. This is perfect for most articles. But for disambiguation pages, the first section does not include any of the links to specific pages, making it a poor solution. But if I request more than the first section, the HTML loading slows down, which is unnecessary for most articles. See this query for an example of a disambiguation page in which links are not included in the first section.

What I have so far

As of right now, I've gotten as far as detecting when a disambiguation page is reached. I use code like

bs4.BeautifulSoup(page_html).find("p", recursive=false).get_text().endswith(("refer to:", "refers to:"))

I also spent a while trying to write code that automatically followed a link, before I realized that the links were not included in

My constraints

I'd prefer to keep the number of requests made to a minimum. I also need to be parsing as little HTML as possible, because speed is essential for my application.

Possible solutions (which I need help executing)

I could envision several solutions to this problem:

A way within the MediaWiki API to automatically follow the first link from disambiguation pages
A method within the Mediawiki API that allows it to return different amounts of HTML content based on a condition (like presence of a disambiguation template)
A way to dramatically improve the speed of bs4 so that it doesn't matter if I end up having to parse the entire page HTML

Solution

As Tgr and everybody said, no, such a feature doesn't exist because it doesn't make sense. The first link in a disambiguation page doesn't have any special status or meaning.

As for the existing API, see https://www.mediawiki.org/wiki/Extension:Disambiguator#API_usage

By the way, the "bot policy" you linked does not really apply to crawlers/scraper; the only relevant policy/guideline is the User-Agent policy.