Search code examples
rweb-scrapingrvestrselenium

Trouble webscraping a webpage with Rvest. The read_html function gives empty objects


I'm new to webscrape and it might be basic, however I'm trying to scrape this link : https://www.europol.europa.eu/media-press/newsroom.

It seems pretty normal but when it comes to read_html, R succeed in reading but the object is almost empty (list insides list and at the end there is no characters inside.)

It might have to do with the fact that the page is interactive, so I used Rselenium to dodge this issue but at the end I always strugle reading the html correctly.

I've tried using a css selector, using the Xpath, using html_nodes, nothing to do I always have the same results character(0).

My code looks like this.

read_html("https://www.europol.europa.eu/media-press/newsroom") %>% 
  html_nodes("div.content-wrapper") %>%
  html_attr("href")

I've tried with html_elements() and the css selector, or many other words in the html_nodes() such as "h3 a" "a"... it always gives character(0). Analysing the object closely it seems to be lists into lists and then empty objects.


Solution

  • The page is dynamic and almost fully rendered by javascript, but the list of news ids along with first few titles are embedded in the page source in script tags. JS script string can be extracted with rvest and with some light string manipulation it can be parsed as JSON. First titles can be extracted directly from resulting JSON, to fetch others we can use news ids in API requests:

    library(dplyr)
    library(rvest)
    library(stringr)
    html <- read_html("https://www.europol.europa.eu/media-press/newsroom")
    server_data <- html %>% html_elements(xpath = "//head/script[contains(text(),'window.SERVER_DATA')]") %>% 
      html_text() %>%
      # extract everything between "window.SERVER_DATA=" and final ";" and
      # parse as JSON
      str_replace("window.SERVER_DATA=(.*);", "\\1") %>% 
      jsonlite::parse_json(simplifyVector = T)
    
    # first 12 titles, embedded in window.SERVER_DATA
    server_data$NodeLoader$node$lists$items[[1]] %>% 
      as_tibble()
    #> # A tibble: 12 × 6
    #>       id type  title                               published alias mainImage$alt
    #>    <int> <chr> <chr>                                   <int> <chr> <chr>        
    #>  1  5442 news  "Cocaine cartel uncovered on SKY E…    1.68e9 /med… pic (1).jpg  
    #>  2  5441 news  "31 migrant smugglers arrested in …    1.68e9 /med… AustrianOak2…
    #>  3  5440 news  "Europol Executive Director Visits…    1.68e9 /med… IMG_0058.jpg 
    #>  4  5438 news  "Balkans' biggest drug lords arres…    1.68e9 /med… 1.jpg        
    #>  5  5437 news  "Underworld ‘co-working’ space shu…    1.68e9 /med… image (2).png
    #>  6  5435 news  "4 arrests in a hit against clan-b…    1.68e9 /med… Teaser_bg_se…
    #>  7  5434 news  "International art trafficking sti…    1.68e9 /med… GREECE Icons…
    #>  8  5433 news  "132 ‘Ndrangheta mafia members arr…    1.68e9 /med… OpEureka.JPG 
    #>  9  5431 news  "288 dark web vendors arrested in …    1.68e9 /med… Spector16x9_…
    #> 10  5432 news  "9 arrested in crackdown on loverb…    1.68e9 /med… IMG-e612d123…
    #> 11  5429 news  "ICC and Europol conclude Working …    1.68e9 /med… 20230425_Eur…
    #> 12  5423 news  "90 victims of sexual exploitation…    1.68e9 /med… PR-25-04-202…
    #> # ℹ 2 more variables: mainImage$mimeType <chr>, $thumbs <list>
    
    # request next 10 items by ids, listed in $NodeLoader$node$lists$ids
    req_ids <- paste0("ids[]=", server_data$NodeLoader$node$lists$ids[[1]][13:23], collapse = "&")
    req_ids
    #> [1] "ids[]=5422&ids[]=5421&ids[]=5420&ids[]=5419&ids[]=5418&ids[]=5416&ids[]=5414&ids[]=5412&ids[]=5411&ids[]=5408&ids[]=5407"
    jsonlite::fromJSON(paste0("https://www.europol.europa.eu/cms/api/items?", req_ids, "&type=node")) %>% 
      as_tibble()
    #> # A tibble: 11 × 6
    #>       id type  title                               published alias mainImage$alt
    #>    <int> <chr> <chr>                                   <int> <chr> <chr>        
    #>  1  5422 news  "50 arrested for trafficking hashi…    1.68e9 /med… 1632be7c-f07…
    #>  2  5421 news  "17 arrested in Spain in bust agai…    1.68e9 /med… ESGuardiaCiv…
    #>  3  5420 news  "19 arrests in Spanish-Swedish rai…    1.68e9 /med… cannabis_tea…
    #>  4  5419 news  "Further action against fraudulent…    1.68e9 /med… 240681152.jpg
    #>  5  5418 news  "Underground drug-money bank laund…    1.68e9 /med… GettyImages-…
    #>  6  5416 news  "Takedown of notorious hacker mark…    1.68e9 /med… Splash_page-…
    #>  7  5414 news  "New Modus Operandi: How organised…    1.68e9 /med… PortReportAn…
    #>  8  5412 news  "Europol supports dismantlement of…    1.68e9 /med… 14opengraph.…
    #>  9  5411 news  "22 firearms traffickers arrested …    1.68e9 /med… 2openg.jpg   
    #> 10  5408 news  "15 arrested in Brazil over 17 ton…    1.68e9 /med… HinterlandPR…
    #> 11  5407 news  "Gym doping bust: traffickers sell…    1.68e9 /med… WhatsApp Ima…
    #> # ℹ 2 more variables: mainImage$mimeType <chr>, $thumbs <list>
    

    Created on 2023-05-23 with reprex v2.0.2

    I'd recommend to get more familiar with browser's dev tools, especially the network tab for discovering API endpoints. You may also want to disable javascript in your browser for the specific site to get an idea which selectors would still work in rvest (js can and often will heavily modify the DOM tree and selectors). Or check page source before navigating through rendered CSS selectors.