Search code examples
python-3.xweb-scrapingbeautifulsoupwikipedia

Scraping Wikipedia tables with Python selectively


I have troubles sorting a wiki table and hope someone who has done it before can give me advice. From the List_of_current_heads_of_state_and_government I need countries (works with the code below) and then only the first mention of Head of state + their names. I am not sure how to isolate the first mention as they all come in one cell. And my attempt to pull their names gives me this error: IndexError: list index out of range. Will appreciate your help!

import requests
from bs4 import BeautifulSoup

wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')

my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)

states = []
titles = []
names = []
for row in my_table.find_all('tr')[1:]:
    state_cell = row.find_all('a')[0]  
    states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
    title_cell = row.find_all('a')[0]
    titles.append(title_cell.text)
print(titles)
for row in my_table.find_all('td'):
    name_cell = row.find_all('a')[1]
    names.append(name_cell.text)
print(names)

Desirable output would be a pandas df:

State | Title | Name |

Solution

  • If I could understand your question then the following should get you there:

    import requests
    from bs4 import BeautifulSoup
    
    URL = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
    
    res = requests.get(URL).text
    soup = BeautifulSoup(res,'lxml')
    for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
        data = items.find_all(['th','td'])
        try:
            country = data[0].a.text
            title = data[1].a.text
            name = data[1].a.find_next_sibling().text
        except IndexError:pass
        print("{}|{}|{}".format(country,title,name))
    

    Output:

    Afghanistan|President|Ashraf Ghani
    Albania|President|Ilir Meta
    Algeria|President|Abdelaziz Bouteflika
    Andorra|Episcopal Co-Prince|Joan Enric Vives Sicília
    Angola|President|João Lourenço
    Antigua and Barbuda|Queen|Elizabeth II
    Argentina|President|Mauricio Macri
    

    And so on ----