Search code examples
pythonweb-scrapingbeautifulsoupxml-parsinghtml-parsing

Python webscraping not showing all rows with BeautifulSoup


Trying to scrape the squad overviews of several webpages from Transfermarkt and realised that for some pages rows were missing.

Here are two example webpages:

Works: All rows included here.

Doesn't work: Rows missing here

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

headers = {'User-Agent' : 'Mozilla/5.0'}
df_headers = ['position_number' , 'position_description' , 'name' , 'dob' , 'nationality' , 'height' , 'foot' , 'joined' , 'signed_from' , 'contract_until']
r = requests.get('https://www.transfermarkt.com/grasshopper-club-zurich-u17/kader/verein/59526/saison_id/2018/plus/1', headers = headers)
soup = bs(r.content, 'html.parser')

position_number = [item.text for item in soup.select('.items .rn_nummer')]
position_description = [item.text for item in soup.select('.items td:not([class])')]
name = [item.text for item in soup.select('.hide-for-small .spielprofil_tooltip')]
dob = [item.text for item in soup.select('.zentriert:nth-of-type(3):not([id])')]
nationality = ['/'.join([i['title'] for i in item.select('[title]')]) for item in soup.select('.zentriert:nth-of-type(4):not([id])')]
height = [item.text for item in soup.select('.zentriert:nth-of-type(5):not([id])')]
foot = [item.text for item in soup.select('.zentriert:nth-of-type(6):not([id])')]
joined = [item.text for item in soup.select('.zentriert:nth-of-type(7):not([id])')]
signed_from = ['/'.join([item['title'].lstrip(': '), item['alt']])  for item in soup.select('.zentriert:nth-of-type(8):not([id]) [title]')]
contract_until = [item.text for item in soup.select('.zentriert:nth-of-type(9):not([id])')]

df = pd.DataFrame(list(zip(position_number, position_description, name, dob, nationality, height, foot, joined, signed_from, contract_until)), columns = df_headers)
print(df)

df.to_csv(r'Uljanas-MacBook-Air-2:~ uljanadufour$\grasshopper18.csv')

This is what I am getting for a page which should contain 22 rows.

  position_number  ... contract_until
0               -  ...              -
1               -  ...              -
2               -  ...              -
3               -  ...              -
4               -  ...              -
5               -  ...              -
6               -  ...              -
7               -  ...              -
8               -  ...     30.06.2019

[9 rows x 10 columns]

Process finished with exit code 0

I can't identify why it works for some and others it does not. Any help would be much appreciated.


Solution

  • The issue is in this line:

    signed_from = ['/'.join([item['title'].lstrip(': '), item['alt']])  for item in soup.select('.zentriert:nth-of-type(8):not([id]) [title]')]
    

    and you can modify it this way:

    signed_from = ['/'.join([item.find('img')['title'].lstrip(': '), item.find('img')['alt']])  if item.find('a') else '' for item in soup.select('.zentriert:nth-of-type(8):not([id])')]