Search code examples
pythonweb-scrapingpython-requestswikipedia

Collect data from side table(s) in wikipedia page(s)


I'm trying to create a python script that can collect information from the side tables in a wikipedia page. For an example see this page. Along the right hand side of the page, there are 3 vertical looking HTML tables. The first is titled "Ford Fusion", the 2nd "First generation", and the 3rd "Second generation".

When I try to collect the HTML for the webpage, the tables on the right are not returned with code like this

import requests
from bs4 import BeautifulSoup

search_string = f"Ford Fusion"
search_url = f"https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srsearch={search_string}"
search_response = requests.get(search_url)
search_data = search_response.json()

closest_match = search_data["query"]["search"][0]["title"]
page_url = f"https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&titles={closest_match}"
page_response = requests.get(page_url)
page_data = page_response.json()

page_id = list(page_data["query"]["pages"].keys())[0]

html_text = page_data["query"]["pages"][page_id]["extract"]
soup = BeautifulSoup(html_text, "html.parser")

tables = soup.find_all('table')
print(len(tables))

>> 0

I've inspected the html_text variable and for some reason the tables aren't even there, even though I can plainly see them when inspecting the webpage in my browser. How can I get these tables to be returned as part of the request.get call to the URL?


Solution

  • The problem is that the wikipedia API endpoint has limitations on what it returns. If you change your code to look like this you will get the tables in the HTML response:

    import requests
    from bs4 import BeautifulSoup
    
    search_string = f"Ford Fusion"
    search_url = f"https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srsearch={search_string}"
    search_response = requests.get(search_url)
    search_data = search_response.json()
    
    closest_match = search_data["query"]["search"][0]["title"]    
    page_url = f"https://en.wikipedia.org/wiki/{closest_match}"
    
    page_response = requests.get(page_url)
    
    html_text = page_response.content.decode()
    soup = BeautifulSoup(html_text, "html.parser")
    
    tables = soup.find_all('table')
    print(len(tables))
    
    >> 13