Search code examples
rweb-scrapingiframervest

How do I scrape information from an `iframe` in R?


I trying to scrape information from this website: https://www.cps.edu/schools/schoolprofiles/acero-santiago

In particular, I want to scrape the Supportive School designation found in the Reports tab. See below for an example:

enter image description here

I want to grab the "Established" text.

Here is my code so far:

library(rvest)

url <- "https://www.cps.edu/schools/schoolprofiles/acero-santiago"
page <- read_html(url)

# select the iframe element
iframe <- page %>% html_element("iframe")

iframe_src <- html_attr(iframe, "src")

iframe_page <- read_html(paste("https://www.cps.edu",iframe_src,sep=""))

But once I get here I still can't find a node to select. Furthermore, I'm still unable to scrape any information from the page. See an example here:

data <- iframe_page %>% html_node("h4") %>% html_text()

I don't get any results.

Any thoughts?


Solution

  • If you search for that phrase - "This school has put in place systems" - on network tab of your browser's dev.tools, you'll find SchoolProgressReport API endpoint:

    library(dplyr)
    library(rvest)
    library(stringr)
    
    url <- "https://www.cps.edu/schools/schoolprofiles/acero-santiago"
    
    progress_report <- read_html(url) %>% 
      html_element("iframe.iframe-page") %>% 
      html_attr("src") %>% 
      str_extract("SchoolId=\\d+$") %>% 
      paste0("https://www.cps.edu/api/schoolprofile/SchoolProgressReport?", .) %>% 
      jsonlite::read_json(simplifyVector = TRUE) %>% 
      tidyr::pivot_longer(everything())
    
    progress_report %>% 
      filter(str_detect(name, fixed("supportive_School_Award")))
    #> # A tibble: 2 × 2
    #>   name                         value                                            
    #>   <chr>                        <chr>                                            
    #> 1 supportive_School_Award      ESTABLISHED                                      
    #> 2 supportive_School_Award_Desc This school has put in place systems and structu…
    

    Created on 2023-02-22 with reprex v2.0.2