Search code examples
pythonpandaswikipedia

How do I scrape a particular table from Wikipedia, using Python?


I'm having difficulty scraping specific tables from Wikipedia. Here is my code.

import pandas as pd
import requests
from bs4 import BeautifulSoup

wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
table_class = "wikitable sortable jquery-tablesorter"
response = requests.get(wikiurl)
print(response.status_code)

soup = BeautifulSoup(response.text, 'html.parser')
cities = soup.find('table', {"class":"wikitable sortable jquery-tablesorter"})

df = pd.read_html(str(cities))
df=pd.DataFrame(df[0])
print(df.to_string())

The class is taken from the info inside the table tag when you inspect the page, I'm using Edge as a browser. Changing the index (df[0]) causes it to say the index is out of range.

Is there a unique identifier in the wikipedia source code for each table? I would like a solution, but I'd really like to know where I'm going wrong too, as I feel I'm close and understand this.


Solution

  • I think your main difficulty was in extracting the html that corresponds to your class... "wikitable sortable jquery-tablesorter" is actually three separate classes and need to be separate entries in the dictionary. I have included two of those entries in the code below.

    Hopefully this should help:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
    table_class = "wikitable sortable jquery-tablesorter"
    response = requests.get(wikiurl)
    print(response.status_code)
    
    # 200
    
    soup = BeautifulSoup(response.text, 'html.parser')
    cities = soup.find_all('table', {"class": "wikitable", "class": "sortable"})
    print(cities[0])
    
    # <table class="wikitable sortable">
    # <tbody><tr>
    # <th>Name of Town
    # </th>
    # <th>State
    # ....
    
    tables = pd.read_html(str(cities[0]))
    print(tables[0])
    
    #      Name of Town           State  ... Population (2011)  Ref
    # 0        Achhnera   Uttar Pradesh  ...             22781  NaN
    # 1          Adalaj         Gujarat  ...             11957  NaN
    # 2           Adoor          Kerala  ...             29171  NaN
    # ....