Using regular expression when scraping with rvest

I'm trying to scrape some parliamentary speeches with rvest. The links I have identify a parliamentary session, but I need to scrape the text of individual MPs' speeches. I am using the general html class '.intervento' because it is consistent across different urls. With that I get a vector of characters with several speeches. However, I am interested in the elements that start with the characters contained in another dataframe column (name_surname). In some cases, multiple elements match the name_surname column and I would like to keep them all.

This is the MWE with the code I have been unsuccessfully using.

library(rvest)
library(stringr)

df <- structure(list(name_surname = c("FERDINANDO ADORNATO", "FERDINANDO ADORNATO", 
                                      "LUCIANO AGOSTINI"), text = c("http://documenti.camera.it/apps/commonServices/getDocumento.ashx?idlegislatura=16&sezione=assemblea&tipoDoc=stenografico&idSeduta=0019&nomefile=stenografico&ancora=sed0019.stenografico.tit00030.sub00060.int00020#sed0019.stenografico.tit00030.sub00060.int00020", 
                                                                    "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?idlegislatura=16&sezione=assemblea&tipoDoc=stenografico&idSeduta=0019&nomefile=stenografico&ancora=sed0019.stenografico.tit00030.sub00060.int00120#sed0019.stenografico.tit00030.sub00060.int00120", 
                                                                    "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?idlegislatura=16&sezione=assemblea&tipoDoc=stenografico&idSeduta=0041&nomefile=stenografico&ancora=sed0041.stenografico.tit00020.sub00010.int01320#sed0041.stenografico.tit00020.sub00010.int01320"
                                      )), row.names = c(NA, 3L), class = "data.frame")

df$text_html <- lapply(df$text, read_html)

# Code that returns all speeches
df$text_final <- lapply(df$text_html, function(x) {
  interventi <- html_nodes(x, '.intervento') %>% html_text(trim = TRUE)
})

# Attempt to select individual speeches
df$text_final2 <- lapply(df$text_html, function(x) {
  x <- html_nodes(x, '.intervento') %>% html_text(trim = TRUE)
  str_subset(x, paste0("^", df$name_surname))
})

Solution

str_subset expects a single value, but you are passing it a vector of three names on each of the passes of lapply (i.e. the entire column df$surname). I would use map2 to iterate through both columns row by row

df$text_final2 <- purrr::map2(df$text_html, df$name_surname, function(text, speaker) {
  x <- html_nodes(text, '.intervento') %>% html_text(trim = TRUE)
  str_subset(x, paste0("^", speaker))
})

Alternatively, you could use a rowwise mutate

library(dplyr)

df <- df |> 
  rowwise() |> 
  mutate(
    text_final3 = text_html |> 
      html_nodes('.intervento') |>  
      html_text(trim = TRUE) |> 
      str_subset(paste0("^", name_surname)) |> 
      list()
  )