Search code examples
htmlweb-scrapingbeautifulsouppython-requests

Scrape data from website with complex structure


I am trying to scrape data from the TransferMarkt website in Python. However, the website structure is complex. I've tried using the requests and Beautiful Soup modules and the following code. However, the end result I'm getting is two empty dataframes for 'in' and 'out' transfers. I want to extract the information from the tables (displayed in the picture) into two separate dataframes. The in_transfers_df should contain the the information displayed in the 'In' tables and the out_transfers_df should contain the information displayed in the 'Out' table. This should be repeated for each header e.g. Arsenal, Aston Villa

I've attached a photo showing the structure of the website and my code attempt. Any help would be greatly appreciated.

enter image description here

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the Transfermarkt page
url = 'https://www.transfermarkt.com/premier-league/transfers/wettbewerb/GB1/plus/?saison_id=2023&s_w=&leihe=0&intern=0'

# Send a GET request to the URL
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers)
response.raise_for_status()  # Raise an exception if the request was unsuccessful

# Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Function to extract transfer data
def extract_transfer_data(table):
    transfers = []
    rows = table.find_all('tr', class_=['odd', 'even'])
    for row in rows:
        cols = row.find_all('td')
        if len(cols) >= 5:  # Ensure there are enough columns
            transfers.append({
                'Player': cols[0].text.strip(),
                'Age': cols[1].text.strip(),
                'Club': cols[2].text.strip(),
                'Fee': cols[4].text.strip()
            })
    return transfers

# Locate the main transfer table container
transfer_containers = soup.find_all('div', class_='grid-view')

# Debugging: print the number of transfer containers found
print(f"Found {len(transfer_containers)} transfer containers.")

# Extract 'In' and 'Out' transfers data
in_transfers = []
out_transfers = []

for container in transfer_containers:
    headers = container.find_all('h2')
    tables = container.find_all('table')
    for header, table in zip(headers, tables):
        if 'In' in header.text:
            in_transfers.extend(extract_transfer_data(table))
        elif 'Out' in header.text:
            out_transfers.extend(extract_transfer_data(table))

# Convert to DataFrames
in_transfers_df = pd.DataFrame(in_transfers)
out_transfers_df = pd.DataFrame(out_transfers)

Solution

  • As @GTK correctly pointed out, your manual is outdated. If you look more carefully, now the data you need are in div elements with class box. It is to them that you need "to cling" to in order to retrieve the necessary data. screen

    However, apart from in such elements, it may not be what we are looking for. For example, this block also has a similar structure. So, you have to be careful.

    screen2

    So if you're looking for help, here's a solution I've sketched out on the fly that will work for you. However, you should take it step by step and understand how it ultimately works. Improve my error handling, plus, load the data into pandas if necessary.

    from collections import defaultdict
    from pprint import pprint
    
    import requests
    from bs4 import BeautifulSoup
    
    start_url = (
        'https://www.transfermarkt.com/premier-league/transfers/wettbewerb/GB1/'
        'plus/?saison_id=2023&s_w=&leihe=0&intern=0'
    )
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    response = requests.get(start_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    
    def extract_club_name(node):
        try:
            return node.find('a')['title']
        except (TypeError, KeyError):
            return None
    
    
    def parse_transfers_table(node):
        for tr in node.find('tbody').find_all('tr'):
            national = tr.find('td', class_='nat-transfer-cell')
            prev_club_data = tr.find(
                'td',
                class_='no-border-links verein-flagge-transfer-cell',
            )
            previous_club = (
                '' if prev_club_data.find('a') is None
                else prev_club_data.find('a')['title']
            )
    
            yield {
                'name': tr.find('span').find('a')['title'],
                'age': tr.find('td', class_='alter-transfer-cell').text,
                'national': [c['title'] for c in national if c.has_attr('title')],
                'position': tr.find('td', class_='kurzpos-transfer-cell').text,
                'market_price': tr.find('td', class_='mw-transfer-cell').text,
                'previous_club': previous_club,
                'transfer_value': tr.find('td', class_='rechts').text,
            }
    
    
    result = defaultdict(defaultdict)
    for club_info in soup.find_all('div', class_='box'):
        club_name = extract_club_name(club_info)
        if club_name is None:
            continue
    
        in_transfers_table, out_transfers_table = (
            club_info.find_all('div', class_='responsive-table')
        )
        result[club_name]['in'] = [*parse_transfers_table(in_transfers_table)]
        result[club_name]['out'] = [*parse_transfers_table(out_transfers_table)]
    
    pprint(result)