Search code examples
pythonpython-3.xseleniumbeautifulsoupgoogle-colaboratory

Grab table from football recruiting website


I would like to create the exact same table as the one shown in the following webpage: https://247sports.com/college/penn-state/Season/2022-Football/Commits/

I am currently using Selenium and Beautiful Soup to start making it happen on a Google Colab notebook because I am getting forbidden errors when performing a "read_html" command. I have just started to get some output, but I only want to grab the text and not the external stuff surrounding it.

Here is my code so far...

from kora.selenium import wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime as dt
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://247sports.com/college/penn-state/Season/2022-Football/Commits/'
wd.get(url)
time.sleep(5)

soup  = BeautifulSoup(wd.page_source)

school=soup.find_all('span', class_='meta')    
name=soup.find_all('div', class_='recruit')
position = soup.find_all('div', class_="position")
height_weight = soup.find_all('div', class_="metrics")
rating = soup.find_all('span', class_='score')
nat_rank = soup.find_all('a', class_='natrank')
state_rank = soup.find_all('a', class_='sttrank')
pos_rank = soup.find_all('a', class_='posrank')
status = soup.find_all('p', class_='commit-date withDate')

status

...and here is my output...

[<p class="commit-date withDate"> Commit 7/25/2020  </p>,
 <p class="commit-date withDate"> Commit 9/4/2020  </p>,
 <p class="commit-date withDate"> Commit 1/1/2021  </p>,
 <p class="commit-date withDate"> Commit 3/8/2021  </p>,
 <p class="commit-date withDate"> Commit 10/29/2020  </p>,
 <p class="commit-date withDate"> Commit 7/28/2020  </p>,
 <p class="commit-date withDate"> Commit 9/8/2020  </p>,
 <p class="commit-date withDate"> Commit 8/3/2020  </p>,
 <p class="commit-date withDate"> Commit 5/1/2021  </p>]

Any assistance on this is greatly appreciated.


Solution

  • There's no need to use Selenium, to get a response from the website you need to specify the HTTP User-Agent header, otherwise, the website thinks that your a bot and will block you.

    To create a DataFrame see this sample:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    
    url = "https://247sports.com/college/penn-state/Season/2022-Football/Commits/"
    # Add the `user-agent` otherwise we will get blocked when sending the request
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
    }
    
    
    response = requests.get(url, headers=headers).content
    soup = BeautifulSoup(response, "html.parser")
    data = []
    
    for tag in soup.find_all("li", class_="ri-page__list-item")[1:]:  # `[1:]` Since the first result is a table header
        school = tag.find_next("span", class_="meta").text
        name = tag.find_next("a", class_="ri-page__name-link").text
        position = tag.find_next("div", class_="position").text
        height_weight = tag.find_next("div", class_="metrics").text
        rating = tag.find_next("span", class_="score").text
        nat_rank = tag.find_next("a", class_="natrank").text
        state_rank = tag.find_next("a", class_="sttrank").text
        pos_rank = tag.find_next("a", class_="posrank").text
        status = tag.find_next("p", class_="commit-date withDate").text
    
        data.append(
            {
                "school": school,
                "name": name,
                "position": position,
                "height_weight": height_weight,
                "rating": rating,
                "nat_rank": nat_rank,
                "state_rank": state_rank,
                "pos_rank": pos_rank,
                "status": status,
            }
        )
    
    df = pd.DataFrame(data)
    
    print(df.to_string())
    

    Output:

                                                        school            name position height_weight  rating nat_rank state_rank pos_rank                status
    0                  Westerville South (Westerville, OH)      Kaden Saunders      WR    5-10 / 172   0.9509      116          5       16    Commit 7/25/2020  
    1                          IMG Academy (Bradenton, FL)        Drew Shelton      OT     6-5 / 290   0.9468      130         17       14     Commit 9/4/2020  
    2                Central Dauphin East (Harrisburg, PA)       Mehki Flowers      WR     6-1 / 190   0.9461      131          4       18     Commit 1/1/2021  
    3                                  Medina (Medina, OH)          Drew Allar     PRO     6-5 / 220   0.9435      138          6        8     Commit 3/8/2021  
    4                     Manheim Township (Lancaster, PA)        Anthony Ivey      WR     6-0 / 190   0.9249      190          6       26   Commit 10/29/2020  
    5                                 King (Milwaukee, WI)         Jerry Cross      TE     6-6 / 218   0.9153      218          4        8    Commit 7/28/2020  
    6                         Northeast (Philadelphia, PA)          Ken Talley     WDE     6-3 / 230   0.9069      253          9       13     Commit 9/8/2020  
    7                              Central York (York, PA)        Beau Pribula    DUAL     6-2 / 215   0.8891      370         12        9     Commit 8/3/2020  
    8   The Williston Northampton School (Easthampton, MA)       Maleek McNeil      OT     6-8 / 340   0.8593      705          8       64     Commit 5/1/2021