Search code examples
rweb-scrapingrvest

getting central table in webpage


I have difficulties scraping data from a website: https://scientific.sparx-ip.net/archiveeular/?c=s&view=2

I would like to get the central table with the abstracts, but if I do


library(rvest)
page <- read_html("https://scientific.sparx-ip.net/archiveeular/?c=s&view=2") 
page %>% html_table()

I only get a small empty table.

> page %>% html_table()
[[1]]
# A tibble: 4 x 2
  X1            X2
  <chr>      <dbl>
1 "version:"  1.02
2 ""         NA   
3 ""         NA   
4 ""         NA   

It looks I am getting only the left side bar by doing so (trying page %>% html_text() gives only left sidebar content).

I tried using sessions, but did not improve.

What am I doing wrong ?


Solution

  • In this case, the URL copied from the address bar does not define the page that you had open in your browser -- those list links (and other controls) generate 2 requests, i.e. href https://scientific.sparx-ip.net/archiveeular/index.cfm?view=2&c=si triggers a view update at server side but returns empty response with a redirect to /archiveeular/?c=s&view=2, which in turn responds with the actual content. Accessing /archiveeular/?c=s&view=2 directly, in a fresh session, delivers a different page.

    So instead of read_html() you'd want to use something that handles redirections, preferably automagically, and can simulate browser session: rvest::session()

    To get to the table and increase the number of items shown on the page, we need to jump through a series of URLs first and keep the session; parsing a session() result is the same as with read_html() result:

    library(rvest)
    library(dplyr, warn.conflicts = FALSE)
    
    # run in session: open page, jump to "Abstract Titles", set page size to 100
    s <- session("https://scientific.sparx-ip.net/archiveeular/index.cfm") |>
      session_jump_to("https://scientific.sparx-ip.net/archiveeular/index.cfm?view=2&c=si") |>
      session_jump_to("https://scientific.sparx-ip.net/archiveeular/calc.cfm?pagesize=100")
    
    # extract page count 
    page_count <- 
      s |>
      html_element("div.items-found > b:last-of-type") |> 
      html_text() |> 
      as.integer()
    page_count
    #> [1] 37
    
    # extract table, handle duplicate column names and drop empty column
    get_table <- function(s) {
      html_element(s, "table.table-result") |> 
        html_table() |>
        setNames(c("item", "title", "blank")) |>
        select(item, title)
    }
    # navigate to a new page
    jump_page <- function(s, n){
      session_jump_to(s, paste0("https://scientific.sparx-ip.net/archiveeular/calc.cfm?page=",n))
    }
    
    # list allocation for tables
    tables <- vector(mode = "list", length = page_count)
    
    # table from current(1st) page
    tables[[1]] <- get_table(s)
    
    # collect tables from next 4 pages
    for (n in 2:5){
      message("Page ", n)
      s <- jump_page(s, n)
      tables[[n]] <- get_table(s)
    }
    #> Page 2
    #> Page 3
    #> Page 4
    #> Page 5
    
    # concat tables from the list
    bind_rows(tables)
    

    Result :

    #> # A tibble: 500 × 2
    #>    item              title                                                      
    #>    <chr>             <chr>                                                      
    #>  1 2023 POS0772      ´´EPIDEMIOLOGY OF JUVENILE IDIOPATHIC ARTHRITIS IN ARGENTI…
    #>  2 2023 POS0147      αVβ3 INTEGRIN AS A LINKER BETWEEN FIBROSIS AND THYROID HOR…
    #>  3 2023 OP0275-HPR   ‘IT’S A LOT TO TAKE IN’: A SYSTEMATIC REVIEW OF THE INFORM…
    #>  4 2023 POS0175      “DO DISEASE MODIFYING ANTIRHEUMATIC DRUGS INFLUENCE THE FR…
    #>  5 2023 AB1732-PARE  “FLARE, DID YOU SAY FLARE?” FLARES IN SJÖGREN’S DISEASE: T…
    #>  6 2023 POS1447      “IF I HAVE SJÖGREN’S SYNDROME, I WANT TO KNOW IT AS EARLY …
    #>  7 2023 AB0201       “IT SURPRISED ME A LOT THAT THERE IS A LINK”: A QUALITATIV…
    #>  8 2023 POS0788-HPR  “IT’S LIKE LISTENING TO THE RADIO WITH A LITTLE INTERFEREN…
    #>  9 2023 POS0201-PARE “MOOD IS HAPPY AND DOWNRIGHT WILD” - HEALTH PROMOTION AND …
    #> 10 2023 POS1585-HPR  “SO, MEN WILL BE ABLE TO RECEIVE #METHOTREXATE FOR LUPUS A…
    #> # ℹ 490 more rows
    

    Created on 2024-01-17 with reprex v2.0.2