Search code examples
pythonweb-scrapingbeautifulsouphtml-table

Webscraping a table with rowspans larger than 1


I want to scrape the first Wikipedia table from https://en.wikipedia.org/wiki/List_of_Eurovision_Song_Contest_host_cities. The difficulty lies in the fact that the table has merged cells (some of the entries have rowspans larger than 1).

So, for example the first entry in the Contests column is 9 and applies to the first 9 rows of the table (it has a rowspan of 9), so then when scraping the data and adding it to a pandas dataframe, I want the first 9 rows in the Contests column to have the entry '9'.

I have tried the following:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_Eurovision_Song_Contest_host_cities'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Create an empty DataFrame with desired column headers
df = pd.DataFrame(columns=['Contests', 'Country', 'City', 'Venue', 'Year', 'Ref'])

for index, row in enumerate(soup.find_all('tr')):
    if index == 0:  # Skip the first header row
        continue

    cells = row.find_all(['td', 'th'])
    
    country_value = None
    if cells[0].has_attr('rowspan'):
        contests_value = cells[0].get_text(strip=True)
        contests_rowspan = int(cells[0]['rowspan'])
        contests_values = [contests_value] * contests_rowspan # Replicate the value the required number of time
        df = df.append(pd.DataFrame({'Contests': contests_values}), ignore_index=True)

    if cells[1].has_attr('rowspan'):
        country_value = cells[1].get_text(strip=True)
        country_rowspan = int(cells[1]['rowspan'])
        country_values = [country_value] * country_rowspan
        df = df.append(pd.DataFrame({'Country': country_values}), ignore_index=True)

    if cells[2].has_attr('rowspan'):
        print(cells[2])
        city_value = cells[2].get_text(strip=True)
        city_rowspan = int(cells[2]['rowspan'])
        city_values = [city_value] * city_rowspan
        df = df.append(pd.DataFrame({'City': city_values}), ignore_index=True)
    
    venue_value = cells[3].get_text(strip=True)
    year_value = cells[4].get_text(strip=True)
    ref_value = cells[5].get_text(strip=True)
    
    for _ in range(max(contests_rowspan, country_rowspan, city_rowspan)):
            df = df.append({'Venue': venue_value, 'Year': year_value, 'Ref': ref_value}, ignore_index=True)

df.head()

The problem with this is that in the first row the cells[0] corresponds to the Contests, cells[1] to Country and cells[2] to the City. However, since these 3 entries all have a rowspan larger than 1, they are not included in the second row HTML code, and so now in the second row cells[0] corresponds to Venue, cells[1] to Year and cells[2] to Ref. Note that my rowspans for Contests, Country and City are not always the same.

I am not sure how to fix this.


Solution

  • In this case, it seems you can let pd.read_html do the heavy lifting for you:

    Option 1: pd.read_html

    import pandas as pd
    
    df = pd.read_html(url)[0] # selecting first table
    
    df.head(2)
    
       Contests         Country    City                  Venue  Year Ref.
    0         9  United Kingdom  London    Royal Festival Hall  1960  [1]
    1         9  United Kingdom  London  BBC Television Centre  1963  [2]
    

    Option 2: for loop

    With the for loop, this could be one approach:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    import numpy as np
    
    url = 'https://en.wikipedia.org/wiki/List_of_Eurovision_Song_Contest_host_cities'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    table = soup.find('table')
    
    cols = ['Contests', 'Country', 'City', 'Venue', 'Year', 'Ref.']
    
    rows = []
    
    for index, row in enumerate(table.find_all('tr')):
        if index == 0:
            continue
                             
        values = [cell.get_text(strip=True) for cell in row.find_all(['td', 'th'])]
        if len(values) != 6:
            values[:0] = [np.nan]*(6-len(values))
        
        rows.append(values)
        
    df = pd.DataFrame(rows, columns=cols).ffill()
    

    Output

    df.head(2)
    
      Contests         Country    City                  Venue  Year Ref.
    0        9  United Kingdom  London    Royal Festival Hall  1960  [1]
    1        9  United Kingdom  London  BBC Television Centre  1963  [2]
    
    # N.B. `pd.read_html` returns `Contests` with dtype `int64`, here: `object`.
    

    Explanation

    • Create a list rows to collect all rows, to be used for pd.DataFrame after the loop.

      (Initializing an empty df and then adding rows to it consecutively is very expensive; avoid df.append in general: it is deprecated since pd 1.4.0.).

    • Inside the loop, use a list comprehension to get_text for each element in row.find_all(['td', 'th']), and store in variable values.

    • For a proper row, len(values) == 6. If len(values) < 6, we will be missing (6-len(values)) values at the start of our list (the logic being hierarchical). Hence, we want to prepend as many NaN values, which we can forward fill later on. For values[:0] assigment, cf. this post.

    • Add values to rows with list.append.

    • After the loop, create your df, and chain df.ffill to fill all NaN values up with the last valid values from the previous rows.