I'm trying to scrape some parliamentary speeches with rvest
. The links I have identify a parliamentary session, but I need to scrape the text of individual MPs' speeches.
I am using the general html class '.intervento' because it is consistent across different urls. With that I get a vector of characters with several speeches. However, I am interested in the elements that start with the characters contained in another dataframe column (name_surname). In some cases, multiple elements match the name_surname column and I would like to keep them all.
This is the MWE with the code I have been unsuccessfully using.
library(rvest)
library(stringr)
df <- structure(list(name_surname = c("FERDINANDO ADORNATO", "FERDINANDO ADORNATO",
"LUCIANO AGOSTINI"), text = c("http://documenti.camera.it/apps/commonServices/getDocumento.ashx?idlegislatura=16&sezione=assemblea&tipoDoc=stenografico&idSeduta=0019&nomefile=stenografico&ancora=sed0019.stenografico.tit00030.sub00060.int00020#sed0019.stenografico.tit00030.sub00060.int00020",
"http://documenti.camera.it/apps/commonServices/getDocumento.ashx?idlegislatura=16&sezione=assemblea&tipoDoc=stenografico&idSeduta=0019&nomefile=stenografico&ancora=sed0019.stenografico.tit00030.sub00060.int00120#sed0019.stenografico.tit00030.sub00060.int00120",
"http://documenti.camera.it/apps/commonServices/getDocumento.ashx?idlegislatura=16&sezione=assemblea&tipoDoc=stenografico&idSeduta=0041&nomefile=stenografico&ancora=sed0041.stenografico.tit00020.sub00010.int01320#sed0041.stenografico.tit00020.sub00010.int01320"
)), row.names = c(NA, 3L), class = "data.frame")
df$text_html <- lapply(df$text, read_html)
# Code that returns all speeches
df$text_final <- lapply(df$text_html, function(x) {
interventi <- html_nodes(x, '.intervento') %>% html_text(trim = TRUE)
})
# Attempt to select individual speeches
df$text_final2 <- lapply(df$text_html, function(x) {
x <- html_nodes(x, '.intervento') %>% html_text(trim = TRUE)
str_subset(x, paste0("^", df$name_surname))
})
str_subset
expects a single value, but you are passing it a vector of three names on each of the passes of lapply
(i.e. the entire column df$surname
). I would use map2
to iterate through both columns row by row
df$text_final2 <- purrr::map2(df$text_html, df$name_surname, function(text, speaker) {
x <- html_nodes(text, '.intervento') %>% html_text(trim = TRUE)
str_subset(x, paste0("^", speaker))
})
Alternatively, you could use a rowwise mutate
library(dplyr)
df <- df |>
rowwise() |>
mutate(
text_final3 = text_html |>
html_nodes('.intervento') |>
html_text(trim = TRUE) |>
str_subset(paste0("^", name_surname)) |>
list()
)