Search code examples
rweb-scrapingcss-selectorsrvest

Why using first/last orders for CSS selectors return error in rvest?


I'm trying to scrape a page that has a few buttons. I want to select/click on the last button. Using Chrome's selector gadget extention, I can successfully select the last buton by adding :last at the end of my selector. But, when I run the following functions in rvest they rerturn: Error in onRejected(reason) : code: -32000 message: DOM Error while querying

Here are the codes:

page <- 
  read_html_live("https://researchers.cedars-sinai.edu/search?by=text&type=user")

page %>% 
  html_elements("span button:last") 

# or

page$click(css = "span button:last")

I have tried these changes but they don't do the job: :nth-child(1), :first-child, and :nth-last-child(1).

Also, I know XPATH can solve the problem. But, the issue is that rvest's click() does not accept XPATH, yet. So, I have to stick with CSS.


Solution

  • You can call on the API instead to fetch everything.

    library(tidyverse)
    library(httr2)
    
    req <- request("https://researchers.cedars-sinai.edu/api/users") %>% 
      req_body_json(list(params = list(by = "text", type = "user"))) %>% 
      req_perform() %>% 
      resp_body_json(simplifyVector = TRUE) 
    
    n <- req %>% 
      pluck("pagination", "total")
    
    df <- map(seq(0, n, 100), 
        ~ request("https://researchers.cedars-sinai.edu/api/users") %>% 
          req_body_json(list(params = list(by = "text", type = "user"), 
                             pagination = list(startFrom = .x, perPage = 100))) %>% 
          req_perform() %>% 
          resp_body_json(simplifyVector = TRUE) %>% 
          pluck("resource") %>% 
          as_tibble()) %>% 
      list_rbind()
    
    # A tibble: 986 × 16
       lastName    overview  hasThumbnail discoveryUrlId positions tags$explicit discoveryId linkedObjectsCounts$…¹
       <chr>       <chr>     <lgl>        <chr>          <list>    <list>        <chr>                        <int>
     1 Abdel-Hafiz "Hany Ab… TRUE         Hany.Abdel-Ha… <df>      <df [1 × 3]>  1513                             1
     2 Abdul-Haqq   NA       TRUE         Ryan.Abdul     <df>      <NULL>        3636                             0
     3 Aboujaoude   NA       TRUE         Elias.Aboujao… <df>      <NULL>        10472                            0
     4 Abuav       "Dr. Abu… TRUE         Rachel.Abuav   <df>      <NULL>        4847                             0
     5 Accortt     "Eynav A… TRUE         Eynav.Accortt  <df>      <df [8 × 3]>  1865                             8
     6 Ader        "The ove… TRUE         Marilyn.Ader   <df>      <df [8 × 3]>  1237                            13
     7 Ahdoot       NA       TRUE         Michael.Ahdoot <df>      <df [1 × 3]>  877                             10
     8 Ahluwalia    NA       TRUE         Sonu.Ahluwalia <df>      <df [1 × 3]>  2958                             0
     9 Ahmed        NA       FALSE        Waseem.Ahmed   <df>      <NULL>        18202                            0
    10 Ainsworth    NA       TRUE         Richard.Ainsw… <df>      <NULL>        7154                             3
    # ℹ 976 more rows
    # ℹ abbreviated name: ¹​linkedObjectsCounts$grants$all
    # ℹ 13 more variables: linkedObjectsCounts$grants$favourites <int>,
    #   linkedObjectsCounts$teachingActivities <df[,2]>, $equipment <df[,2]>, $professionalActivities <df[,2]>,
    #   $publications <df[,2]>, firstName <chr>, firstNameLastName <chr>, equipmentLinkTypes <list>,
    #   objectId <int>, updatedWhen <chr>, hasCollaborationData <lgl>, embeddableMediaList <list>,
    #   customFilterOne <list>
    # ℹ Use `print(n = ...)` to see more rows