I'm using the Wikipedia API to get page HTML which I parse. I use queries like this one to get the HTML for the first section of a page.
The MediaWiki API provides a handy parameter, redirects
, which will cause the API to automatically follow pages that redirect other pages. For example, if I search for 'Cats' with https://en.wikipedia.org/w/api.php?page=Cats&redirects
, I will be shown the results for Cat
because Cats
redirects to Cat
.
I'd like a similar function for disambiguation pages such as this, by which if I arrive at a disambiguation page, I am automatically redirected to the first link. For example, if I make a request to a page like Mercury, I'd automatically be redirected to Mercury (element), as it is the first link listed in the page.
The Python HTML parser BeautifulSoup
is fairly slow on large documents. By only requesting the first section of articles (that's all I need for my use), using section=0
, I can parse it quickly. This is perfect for most articles. But for disambiguation pages, the first section does not include any of the links to specific pages, making it a poor solution. But if I request more than the first section, the HTML loading slows down, which is unnecessary for most articles. See this query for an example of a disambiguation page in which links are not included in the first section.
As of right now, I've gotten as far as detecting when a disambiguation page is reached. I use code like
bs4.BeautifulSoup(page_html).find("p", recursive=false).get_text().endswith(("refer to:", "refers to:"))
I also spent a while trying to write code that automatically followed a link, before I realized that the links were not included in
I'd prefer to keep the number of requests made to a minimum. I also need to be parsing as little HTML as possible, because speed is essential for my application.
I could envision several solutions to this problem:
bs4
so that it doesn't matter if I end up having to parse the entire page HTMLAs Tgr and everybody said, no, such a feature doesn't exist because it doesn't make sense. The first link in a disambiguation page doesn't have any special status or meaning.
As for the existing API, see https://www.mediawiki.org/wiki/Extension:Disambiguator#API_usage
By the way, the "bot policy" you linked does not really apply to crawlers/scraper; the only relevant policy/guideline is the User-Agent policy.