Search code examples
rselenium-webdriverrselenium

Web scraping with R and RSelenium -- difficulty with drop down menu and select


I'm trying to access files from this website: https://public.education.mn.gov/MDEAnalytics/DataTopic.jsp?TOPICID=11 I want the level to correspond to county, and I want to do it for each year. For the sake of this example, assume I only want to do it for 2022. I got RSelenium up and running, but everything I've tried to find the select menu elements with RSelenium hasn't worked.

For instance:

remDr <- remote_driver$client
remDr$open()
remDr$navigate("https://public.education.mn.gov/MDEAnalytics/DataTopic.jsp?TOPICID=11")
data_table <- remDr$findElement(using = 'id', value = "cmbCOLuMN")

returns an error: "An element could not be located on the page using the given search parameters".

I've tried to change the using and values parameters in findElement(), and still no such luck. I would be grateful for any insight into how to select the level to be County and the year to be 2022.

Update: I was able to make more progress with this code based on a previous stackoverflow response that talks about iframes, but am still getting stuck at the end:

remDr$open()
remDr$navigate("https://public.education.mn.gov/MDEAnalytics/DataTopic.jsp?TOPICID=11") 
frames <- remDr$findElements('css', "iframe")
remDr$switchToFrame(frames[[1]])
selectElem <- remDr$findElement("id", "cmbCOLUMN1")
selectOpt <- selectElem$selectTag()

I'm not able to use selectOpt feature to choose the value I want, which would be something like SelectOpt$text$County


Solution

  • It looks like there's a second iframe. I ran your code through this:

    frames <- remDr$findElements('css', "iframe")
    remDr$switchToFrame(frames[[1]])
    

    Then clicked the "list files" button using its html id.

    
    # click on a button ------------------------------------
    remDr$findElement(using = "id",value = "button1")$clickElement()
    

    Clicking on that button shows all the available files in a second iframe, so I found the second iframe with its html id (#report) and switched to it.

    # switch to iframe 2 ------------------------------------
    report_frame <- remDr$findElement(using = "id",value = "report")
    remDr$switchToFrame(report_frame)
    
    

    Then I pulled the page's html and scaned it for tables

    # Pull page html
    page_html <- remDr$getPageSource()[[1]] %>% 
      read_html()
    
    # extract tables
    
    tables <- page_html %>% html_table()
    
    files_table <- tables[[2]]
    

    I'm assuming this is what you wanted? A data frame with a list of all the available files:

    # A tibble: 1,458 × 6
       X1       X2       X3     X4                                     X5     X6   
       <chr>    <chr>    <chr>  <chr>                                  <chr>  <chr>
     1 ""       ""       ""     ""                                     ""     ""   
     2 "Level"  "Name"   "Year" "Document"                             "Data… "Hel…
     3 "County" "Aitkin" "2022" "2022 Minnesota Student Survey County" "pdf"  ""   
     4 "County" "Aitkin" "2019" "2019 Minnesota Student Survey County" "pdf"  ""   
     5 "County" "Aitkin" "2016" "2016 Minnesota Student Survey County" "pdf"  ""   
     6 "County" "Aitkin" "2013" "2013 Minnesota Student Survey County" "pdf"  ""   
     7 "County" "Anoka"  "2022" "2022 Minnesota Student Survey County" "pdf"  ""   
     8 "County" "Anoka"  "2019" "2019 Minnesota Student Survey County" "pdf"  ""   
     9 "County" "Anoka"  "2016" "2016 Minnesota Student Survey County" "pdf"  ""   
    10 "County" "Anoka"  "2013" "2013 Minnesota Student Survey County" "pdf"  ""   
    # … with 1,448 more rows
    # ℹ Use `print(n = ...)` to see more rows