Search code examples
rweb-scrapingbeautifulsoupcss-selectorsrvest

Scraping With CSS Returning NULL in R & Python


I'm trying to scrape all signed football players from a state (in this case Alabama) in a given year (in this case 2005). This is the website link for that information: https://247sports.com/Season/2005-Football/Recruits/?&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6

To start, I just want to scrape the name. I've successfully scraped before with this code (on another site/page), but this time, the value I'm getting with Selector Gadget is ".name a" and when I put that in my code, both R & Python, I do not get any information back.

R Code

#pull in website by year and state, testing with Alabama in 2005
web_link2 <- "https://247sports.com/Season/2005-Football/Recruits/?&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6"
web247_in2 <- read_html(web_link2)

#pull the body of the html site
web_body2 <- web247_in2 %>%
html_node("body") %>%
html_children()
  
#Pull out all data from website by variable & clean up#
commit_names2 <- html_nodes(web_body2, '.name a') %>%
html_text() %>%
as.data.frame()

Python Code

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://247sports.com/Season/2005-Football/Recruits/?&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/573.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, features="html.parser")

commit_names = soup.select(".names a")
print(commit_names)

To make matters worse, this page is a scrolling page that displays more information as you scroll. I plan to tackle that next once I can get this to pull successfully.

Here's an example of another page on this site that I was able to scrape successfully with the same code.

Successful R Scrape Example

web_link <- "https://247sports.com/Season/2005-Football/Commits/?RecruitState=AL"
web247_in <- read_html(web_link)
  
#pull the body of the html site
web_body <- web247_in %>%
html_node("body") %>%
html_children()
  
#Pull out all data from website by variable & clean up#
commit_names <- html_nodes(web_body, '.ri-page__name-link') %>%
html_text() %>%
as.data.frame()

Successful Python Scrape Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://247sports.com/Season/2010-Football/Commits/?RecruitState=AL"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/573.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, features="html.parser")

commit_names = soup.select(".ri-page__name-link")
print(commit_names)

My preference is R, but I'll take whatever I can get at this point. Can anyone shed some light on what I'm missing here? The only thing that seems to change is the CSS value for scraping and the actual page - but it just isn't pulling the data.

Thanks for your help!!!


Solution

  • The table's contents are loaded dynamically; that's why it cannot be found if you scrape it that way.

    If you right-click the page and click 'Inspect element', go to the 'Network' tab and refresh the page, you see an XHR (data) request being made to

    https://247sports.com/Season/2005-Football/Recruits.json?&Items=15&Page=1&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6

    This request returns a JSON containing the information you want.

    Some R code to load this using jsonlite and parse this using tidyr::unnest_wider (see this vignette for help on that function):

    library(jsonlite)
    library(rvest)
    
    url <- "https://247sports.com/Season/2005-Football/Recruits.json?&Items=15&Page=1&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6"
    res <- read_json(url)
    
    tibble(res = res) %>% 
      unnest_wider(res) %>% 
      unnest_wider(Player, names_sep = "_")
    

    Which gives a tibble containing the player information:

    # A tibble: 15 x 50
         Key Player_Key Player_Hometown Player_FirstName Player_LastName Player_FullName Player_Height Player_Weight Player_Bio
       <int>      <int> <list>          <chr>            <chr>           <chr>           <chr>                 <dbl> <chr>     
     1 27063      25689 <named list [2… Chris            Keys            Chris Keys      6-2                     215 Chris Key…
     2 44079      41761 <named list [2… Tommy            Trott           Tommy Trott     6-4                     235 Tommy Tro…
     3 44073      41755 <named list [2… Rex              Sharpe          Rex Sharpe      6-3                     215 Rex Sharp…
     4 44053      41735 <named list [2… Gabe             McKenzie        Gabe McKenzie   6-3                     218 Gabe McKe…
     5 44015      41697 <named list [2… Montez           Billings        Montez Billings 6-2                     175 Montez Bi…
     6 44241      41921 <named list [2… Bobby            Greenwood       Bobby Greenwood 6-4                     239 Bobby Gre…