I have difficulties scraping data from a website: https://scientific.sparx-ip.net/archiveeular/?c=s&view=2
I would like to get the central table with the abstracts, but if I do
library(rvest)
page <- read_html("https://scientific.sparx-ip.net/archiveeular/?c=s&view=2")
page %>% html_table()
I only get a small empty table.
> page %>% html_table()
[[1]]
# A tibble: 4 x 2
X1 X2
<chr> <dbl>
1 "version:" 1.02
2 "" NA
3 "" NA
4 "" NA
It looks I am getting only the left side bar by doing so (trying page %>% html_text()
gives only left sidebar content).
I tried using sessions
, but did not improve.
What am I doing wrong ?
In this case, the URL copied from the address bar does not define the page that you had open in your browser -- those list links (and other controls) generate 2 requests, i.e. href https://scientific.sparx-ip.net/archiveeular/index.cfm?view=2&c=si
triggers a view update at server side but returns empty response with a redirect to /archiveeular/?c=s&view=2
, which in turn responds with the actual content. Accessing /archiveeular/?c=s&view=2
directly, in a fresh session, delivers a different page.
So instead of read_html()
you'd want to use something that handles redirections, preferably automagically, and can simulate browser session: rvest::session()
To get to the table and increase the number of items shown on the page, we need to jump through a series of URLs first and keep the session; parsing a session()
result is the same as with read_html()
result:
library(rvest)
library(dplyr, warn.conflicts = FALSE)
# run in session: open page, jump to "Abstract Titles", set page size to 100
s <- session("https://scientific.sparx-ip.net/archiveeular/index.cfm") |>
session_jump_to("https://scientific.sparx-ip.net/archiveeular/index.cfm?view=2&c=si") |>
session_jump_to("https://scientific.sparx-ip.net/archiveeular/calc.cfm?pagesize=100")
# extract page count
page_count <-
s |>
html_element("div.items-found > b:last-of-type") |>
html_text() |>
as.integer()
page_count
#> [1] 37
# extract table, handle duplicate column names and drop empty column
get_table <- function(s) {
html_element(s, "table.table-result") |>
html_table() |>
setNames(c("item", "title", "blank")) |>
select(item, title)
}
# navigate to a new page
jump_page <- function(s, n){
session_jump_to(s, paste0("https://scientific.sparx-ip.net/archiveeular/calc.cfm?page=",n))
}
# list allocation for tables
tables <- vector(mode = "list", length = page_count)
# table from current(1st) page
tables[[1]] <- get_table(s)
# collect tables from next 4 pages
for (n in 2:5){
message("Page ", n)
s <- jump_page(s, n)
tables[[n]] <- get_table(s)
}
#> Page 2
#> Page 3
#> Page 4
#> Page 5
# concat tables from the list
bind_rows(tables)
Result :
#> # A tibble: 500 × 2
#> item title
#> <chr> <chr>
#> 1 2023 POS0772 ´´EPIDEMIOLOGY OF JUVENILE IDIOPATHIC ARTHRITIS IN ARGENTI…
#> 2 2023 POS0147 αVβ3 INTEGRIN AS A LINKER BETWEEN FIBROSIS AND THYROID HOR…
#> 3 2023 OP0275-HPR ‘IT’S A LOT TO TAKE IN’: A SYSTEMATIC REVIEW OF THE INFORM…
#> 4 2023 POS0175 “DO DISEASE MODIFYING ANTIRHEUMATIC DRUGS INFLUENCE THE FR…
#> 5 2023 AB1732-PARE “FLARE, DID YOU SAY FLARE?” FLARES IN SJÖGREN’S DISEASE: T…
#> 6 2023 POS1447 “IF I HAVE SJÖGREN’S SYNDROME, I WANT TO KNOW IT AS EARLY …
#> 7 2023 AB0201 “IT SURPRISED ME A LOT THAT THERE IS A LINK”: A QUALITATIV…
#> 8 2023 POS0788-HPR “IT’S LIKE LISTENING TO THE RADIO WITH A LITTLE INTERFEREN…
#> 9 2023 POS0201-PARE “MOOD IS HAPPY AND DOWNRIGHT WILD” - HEALTH PROMOTION AND …
#> 10 2023 POS1585-HPR “SO, MEN WILL BE ABLE TO RECEIVE #METHOTREXATE FOR LUPUS A…
#> # ℹ 490 more rows
Created on 2024-01-17 with reprex v2.0.2