Search code examples
rstringrvest

Using regular expression when scraping with rvest


I'm trying to scrape some parliamentary speeches with rvest. The links I have identify a parliamentary session, but I need to scrape the text of individual MPs' speeches. I am using the general html class '.intervento' because it is consistent across different urls. With that I get a vector of characters with several speeches. However, I am interested in the elements that start with the characters contained in another dataframe column (name_surname). In some cases, multiple elements match the name_surname column and I would like to keep them all.

This is the MWE with the code I have been unsuccessfully using.

library(rvest)
library(stringr)

df <- structure(list(name_surname = c("FERDINANDO ADORNATO", "FERDINANDO ADORNATO", 
                                      "LUCIANO AGOSTINI"), text = c("http://documenti.camera.it/apps/commonServices/getDocumento.ashx?idlegislatura=16&sezione=assemblea&tipoDoc=stenografico&idSeduta=0019&nomefile=stenografico&ancora=sed0019.stenografico.tit00030.sub00060.int00020#sed0019.stenografico.tit00030.sub00060.int00020", 
                                                                    "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?idlegislatura=16&sezione=assemblea&tipoDoc=stenografico&idSeduta=0019&nomefile=stenografico&ancora=sed0019.stenografico.tit00030.sub00060.int00120#sed0019.stenografico.tit00030.sub00060.int00120", 
                                                                    "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?idlegislatura=16&sezione=assemblea&tipoDoc=stenografico&idSeduta=0041&nomefile=stenografico&ancora=sed0041.stenografico.tit00020.sub00010.int01320#sed0041.stenografico.tit00020.sub00010.int01320"
                                      )), row.names = c(NA, 3L), class = "data.frame")

df$text_html <- lapply(df$text, read_html)

# Code that returns all speeches
df$text_final <- lapply(df$text_html, function(x) {
  interventi <- html_nodes(x, '.intervento') %>% html_text(trim = TRUE)
})

# Attempt to select individual speeches
df$text_final2 <- lapply(df$text_html, function(x) {
  x <- html_nodes(x, '.intervento') %>% html_text(trim = TRUE)
  str_subset(x, paste0("^", df$name_surname))
})



Solution

  • str_subset expects a single value, but you are passing it a vector of three names on each of the passes of lapply (i.e. the entire column df$surname). I would use map2 to iterate through both columns row by row

    df$text_final2 <- purrr::map2(df$text_html, df$name_surname, function(text, speaker) {
      x <- html_nodes(text, '.intervento') %>% html_text(trim = TRUE)
      str_subset(x, paste0("^", speaker))
    })
    

    Alternatively, you could use a rowwise mutate

    library(dplyr)
    
    df <- df |> 
      rowwise() |> 
      mutate(
        text_final3 = text_html |> 
          html_nodes('.intervento') |>  
          html_text(trim = TRUE) |> 
          str_subset(paste0("^", name_surname)) |> 
          list()
      )