I'm trying to scrape all signed football players from a state (in this case Alabama) in a given year (in this case 2005). This is the website link for that information: https://247sports.com/Season/2005-Football/Recruits/?&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6
To start, I just want to scrape the name. I've successfully scraped before with this code (on another site/page), but this time, the value I'm getting with Selector Gadget is ".name a" and when I put that in my code, both R & Python, I do not get any information back.
R Code
#pull in website by year and state, testing with Alabama in 2005
web_link2 <- "https://247sports.com/Season/2005-Football/Recruits/?&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6"
web247_in2 <- read_html(web_link2)
#pull the body of the html site
web_body2 <- web247_in2 %>%
html_node("body") %>%
html_children()
#Pull out all data from website by variable & clean up#
commit_names2 <- html_nodes(web_body2, '.name a') %>%
html_text() %>%
as.data.frame()
Python Code
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://247sports.com/Season/2005-Football/Recruits/?&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/573.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, features="html.parser")
commit_names = soup.select(".names a")
print(commit_names)
To make matters worse, this page is a scrolling page that displays more information as you scroll. I plan to tackle that next once I can get this to pull successfully.
Here's an example of another page on this site that I was able to scrape successfully with the same code.
Successful R Scrape Example
web_link <- "https://247sports.com/Season/2005-Football/Commits/?RecruitState=AL"
web247_in <- read_html(web_link)
#pull the body of the html site
web_body <- web247_in %>%
html_node("body") %>%
html_children()
#Pull out all data from website by variable & clean up#
commit_names <- html_nodes(web_body, '.ri-page__name-link') %>%
html_text() %>%
as.data.frame()
Successful Python Scrape Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://247sports.com/Season/2010-Football/Commits/?RecruitState=AL"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/573.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, features="html.parser")
commit_names = soup.select(".ri-page__name-link")
print(commit_names)
My preference is R, but I'll take whatever I can get at this point. Can anyone shed some light on what I'm missing here? The only thing that seems to change is the CSS value for scraping and the actual page - but it just isn't pulling the data.
Thanks for your help!!!
The table's contents are loaded dynamically; that's why it cannot be found if you scrape it that way.
If you right-click the page and click 'Inspect element', go to the 'Network' tab and refresh the page, you see an XHR (data) request being made to
This request returns a JSON containing the information you want.
Some R code to load this using jsonlite
and parse this using tidyr::unnest_wider
(see this vignette for help on that function):
library(jsonlite)
library(rvest)
url <- "https://247sports.com/Season/2005-Football/Recruits.json?&Items=15&Page=1&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6"
res <- read_json(url)
tibble(res = res) %>%
unnest_wider(res) %>%
unnest_wider(Player, names_sep = "_")
Which gives a tibble containing the player information:
# A tibble: 15 x 50
Key Player_Key Player_Hometown Player_FirstName Player_LastName Player_FullName Player_Height Player_Weight Player_Bio
<int> <int> <list> <chr> <chr> <chr> <chr> <dbl> <chr>
1 27063 25689 <named list [2… Chris Keys Chris Keys 6-2 215 Chris Key…
2 44079 41761 <named list [2… Tommy Trott Tommy Trott 6-4 235 Tommy Tro…
3 44073 41755 <named list [2… Rex Sharpe Rex Sharpe 6-3 215 Rex Sharp…
4 44053 41735 <named list [2… Gabe McKenzie Gabe McKenzie 6-3 218 Gabe McKe…
5 44015 41697 <named list [2… Montez Billings Montez Billings 6-2 175 Montez Bi…
6 44241 41921 <named list [2… Bobby Greenwood Bobby Greenwood 6-4 239 Bobby Gre…