Search code examples
pythonweb-scrapingbeautifulsouphtml-parsing

How to scrape a website table where the cell values have the same class name?


I am trying to scrape a (football squad) table from Transfermarkt.com for a project but some columns have the same class name and cannot be differentiated.

Column [2,10] have unique classes and work fine. I am struggling to find a way to get the rest.

from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent':
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://www.transfermarkt.com/hertha-bsc-u17/kader/verein/21066/saison_id/2018/plus/1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Values = pageSoup.find_all("td", {"class": "zentriert"})

PlayersList = []
ValuesList = []

for i in range(0, 25):
    PlayersList.append(Players[i].text)
    ValuesList.append(Values[i].text)

df = pd.DataFrame({"Players": PlayersList, "Values": ValuesList})

I would like to scrape all columns on rows of that table.


Solution

  • Using bs4, pandas and css selectors. This separates out position e.g. goalkeeper from name. It doesn't include market value as is no values are given. For any given player - it shows all values for a player's nationality "/" separated; gives all values for transfer from "/" separated. columns with same class can be differentiated by nth-of-type.

    from bs4 import BeautifulSoup as bs
    import requests
    import pandas as pd
    
    headers = {'User-Agent' : 'Mozilla/5.0'}
    df_headers = ['position_number' , 'position_description' , 'name' , 'dob' , 'nationality' , 'height' , 'foot' , 'joined' , 'signed_from' , 'contract_until']
    r = requests.get('https://www.transfermarkt.com/hertha-bsc-u17/kader/verein/21066/saison_id/2018/plus/1', headers = headers)
    soup = bs(r.content, 'lxml')
    
    position_number = [item.text for item in soup.select('.items .rn_nummer')]
    position_description = [item.text for item in soup.select('.items td:not([class])')]
    name = [item.text for item in soup.select('.hide-for-small .spielprofil_tooltip')]
    dob = [item.text for item in soup.select('.zentriert:nth-of-type(3):not([id])')]
    nationality = ['/'.join([i['title'] for i in item.select('[title]')]) for item in soup.select('.zentriert:nth-of-type(4):not([id])')]
    height = [item.text for item in soup.select('.zentriert:nth-of-type(5):not([id])')]
    foot = [item.text for item in soup.select('.zentriert:nth-of-type(6):not([id])')]
    joined = [item.text for item in soup.select('.zentriert:nth-of-type(7):not([id])')]
    signed_from = ['/'.join([item['title'].lstrip(': '), item['alt']])  for item in soup.select('.zentriert:nth-of-type(8):not([id]) [title]')]
    contract_until = [item.text for item in soup.select('.zentriert:nth-of-type(9):not([id])')]
    
    df = pd.DataFrame(list(zip(position_number, position_description, name, dob, nationality, height, foot, joined, signed_from, contract_until)), columns = df_headers)
    print(df.head())
    

    Example df.head