Search code examples
htmlhrefrvest

how to get html_nodes object in rvest


I have trouble understanding where and how to get "href" objects. I found this Harvesting the web with rvest, it is super helpful but I could not get where and how to get '.cast_list .character' on the website. I believe they changed the website structure but the initial idea, is to get the list of href as in

"/title/tt1490017/characters/nm0004715?ref_=tt_cl_t1" and then make it becomes "https://www.imdb.com/title/tt1490017/characters/nm0004715?ref_=tt_cl_t1"

  html_nodes(html, ".cast_list .character") %>% 
  html_children() %>% 
  html_attr("href")

Solution

  • The html has changed. The href construction has not changed so I would simply switch to using an attribute = value selector with contains (*) operator to target the required href by their value containing the string "character". Additional benefits here, over the tutorial, are 1) no duplicate return values 2) more robust over time.

    You could choose to start the selector list with a node match that restricted the search space in the DOM further e.g. .title-cast__grid [href*=character] but it is not necessary currently.

    To read more about CSS selectors and operators I recommend the MDN developer pages e.g. https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors


    library(rvest)
    
    url <- "https://www.imdb.com/title/tt1490017/"
    
    characters <- read_html(url) |>
      html_elements("[href*=character]")
    
    data.frame(
      character = characters |> html_text2(),
      link = characters |> html_attr("href") |> url_absolute(url)
    )