Search code examples
cssrrvest

Scraping with {rvest} yields "character (empty)"?


I've been scraping a file, but now there's a new URL - I just tried to chg. the URL and CSS-selector - but my link-object don't result in a searchpath but "character (empty)" - what's seems to be the problem?

Site: https://arbetsformedlingen.se/statistik/statistik-om-varsel

I want to grab the file "Tillfällig statistik per län och bransch, januari-april 2023" in the 'box' Antal varsel och berörda personer.

R-code:

library(tidyverse)
library(stringr)
library(rio) #import-function
library(rvest) #read_html()-function


# Link to target site
url <- "https://arbetsformedlingen.se/statistik/statistik-om-varsel"

## Parsa HTML-innehållet
doc <- read_html(url)

## Hitta data som du vill skrapa
### Select CSS locator
link <- html_elements(doc, css = '#cardContainer > app-downloads:nth-child(3) > div > div:nth-child(3) > div > digi-link-internal > digi-link > a') %>%
  html_attr("href")

# Create URL for file download
url2 <- "https://arbetsformedlingen.se"
full_link <- sprintf("%s%s", url2, link)

# Get and save file locally
td = tempdir()              # skapa temporär mapp
varsel_fil <- tempfile(tmpdir=td, fileext = ".xlsx")
download.file(full_link, destfile = varsel_fil, mode = "wb")   

# Read file into a df
df_imported <- import(varsel_fil, which=1) #which - välj 'sheet'-nr

Previously the css-argument in the html_elements-function was #svid12_142311c317a09e842af1a94 > div.sv-text-portlet-content > p:nth-child(20) > strong > a

-> So the beginning is quite different - I don't understand what it implies though..

Thanks for any assistance!


Solution

  • That page is now mostly rendered by JavaScript and most of that content is not included in the page source, you can check by disabling JS for the site in your browser. List of files in the described box is pulled from https://arbetsformedlingen.se/rest/analysportalen/va/sitevision.

    A quick way to find this API endpoint would be through the network tab of browser's developer tools -- after launching dev tools, refresh the page to capture all requests and search for some phrase that can't be found from the source of the main page, i.e. "januari-april", looks something like this. Once the API endpoint with file list is identified, we can extract the file URL and proceed with the download:

    library(dplyr)
    url_ <- "https://arbetsformedlingen.se/rest/analysportalen/va/sitevision"
    xlsx_response <- jsonlite::fromJSON(url_, simplifyVector = FALSE) %>% 
      # there are 3 files listed, naively picking the last one; 
      # may or may not work in a long run
      dplyr::last() %>% 
      # we could also use purrr::keep() **before** last() to keep only 
      # the record with matching name, like 
      # purrr::keep(~ .x$name == "Länktext till varsel tillfällig statistik")
      purrr::pluck("properties", "link") %>% 
      # combine first part of url_ and extracted link to get full URL for the file
      # last parameter, ".", is where the output of previous pipe ends up 
      # next expression evaluates as:
      # str_replace("https://arbetsformedlingen.se/rest/analysportalen/va/sitevision", "(?<=\\w)/.*", "/download/18.793fa1821869801540c14b3/1683719045927/web-varsel-bransch-lan-2023.xlsx")
      stringr::str_replace(url_, "(?<=\\w)/.*", .) %>%
      httr2::request() %>% 
      # we can store the file by setting the path
      httr2::req_perform(path = file.path(tempdir(), basename(.$url)))
    
    # httr2 response:
    xlsx_response
    #> <httr2_response>
    #> GET
    #> https://arbetsformedlingen.se/download/18.793fa1821869801540c14b3/1683719045927/web-varsel-bransch-lan-2023.xlsx
    #> Status: 200 OK
    #> Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
    #> Body: On disk 'body'
    
    # open file location:
    # browseURL(xlsx_response$body[1])
    

    Downloaded file:

    readxl::read_xlsx(xlsx_response$body[1], 1, "A4:Y26") %>% glimpse()
    #> New names:
    #> • `` -> `...1`
    #> • `` -> `...2`
    #> • `` -> `...25`
    #> Rows: 22
    #> Columns: 25
    #> $ ...1  <chr> "SNI-kod", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"…
    #> $ ...2  <chr> "Näringsgren", "Jordbruk, skogsbruk och fiske", "Utvinning av mi…
    #> $ AB    <chr> "Stockholms län", "5", NA, "56", NA, NA, "256", "384", "141", "3…
    #> $ C     <chr> "Uppsala län", NA, NA, NA, NA, NA, "18", NA, NA, NA, NA, NA, NA,…
    #> $ D     <chr> "Södermanlands län", NA, NA, NA, NA, "8", NA, NA, NA, NA, NA, NA…
    #> $ E     <chr> "Östergötlands län", NA, NA, "20", NA, NA, "61", NA, "17", NA, N…
    #> $ F     <chr> "Jönköpings län", NA, "5", "246", NA, NA, NA, "21", NA, NA, NA, …
    #> $ G     <chr> "Kronobergs län", NA, NA, "201", NA, NA, "19", NA, NA, NA, NA, N…
    #> $ H     <chr> "Kalmar län", NA, NA, NA, NA, NA, NA, NA, "15", NA, NA, NA, NA, …
    #> $ I     <chr> "Gotlands län", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
    #> $ K     <chr> "Blekinge län", NA, NA, "9", NA, NA, NA, "18", NA, NA, NA, NA, N…
    #> $ M     <chr> "Skåne län", NA, NA, "30", NA, NA, "57", "63", NA, NA, "46", NA,…
    #> $ N     <chr> "Hallands län", NA, NA, "24", NA, NA, "15", "6", NA, NA, NA, NA,…
    #> $ O     <chr> "Västra Götalands län", NA, NA, "62", NA, NA, "120", "27", "96",…
    #> $ S     <chr> "Värmlands län", NA, NA, "20", NA, NA, NA, NA, NA, "10", NA, NA,…
    #> $ T     <chr> "Örebro län", NA, NA, "47", NA, NA, NA, "8", NA, "18", NA, NA, "…
    #> $ U     <chr> "Västmanlands län", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
    #> $ W     <chr> "Dalarnas län", NA, NA, NA, NA, NA, "15", NA, NA, NA, NA, NA, NA…
    #> $ X     <chr> "Gävleborgs län", NA, NA, "32", NA, NA, NA, NA, NA, "50", NA, NA…
    #> $ Y     <chr> "Västernorrlands län", NA, NA, "10", NA, NA, NA, NA, NA, NA, NA,…
    #> $ Z     <chr> "Jämtlands län", NA, NA, "14", NA, NA, NA, NA, NA, NA, "6", NA, …
    #> $ AC    <chr> "Västerbottens län", NA, NA, "50", NA, NA, "9", NA, "16", NA, NA…
    #> $ BD    <chr> "Norrbottens län", NA, NA, "6", NA, NA, "15", "6", NA, NA, NA, N…
    #> $ `-`   <chr> "Uppgift saknas", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    #> $ ...25 <chr> "Riket", "5", "5", "827", NA, "8", "585", "533", "285", "125", "…
    

    Created on 2023-05-20 with reprex v2.0.2

    More base-like approach would perhaps be:

    download.file(paste0("https://arbetsformedlingen.se",
                         jsonlite::read_json("https://arbetsformedlingen.se/rest/analysportalen/va/sitevision")[[3]]$properties$link),
                  file.path(tempdir(),"out.xlsx"), mode = "wb")