Search code examples
web-scrapingxpathrvest

xpath issue of rvest


The code below returns text of li after h2 id containing best. The issue is that it returns even when h2 id does not contain best. To debug it, see the output of li_texts from Best ChatGPT prompts for Sales, it gets repeated. The output should stop till Best ChatGPT prompts for Games

library(rvest)
url <- "https://xxxxxxxxxxx.com/blog/"
html <- read_html(url)

df <- html %>%
  html_nodes(xpath = "//h2[contains(@id,'best')]") %>%
  lapply(function(h2_node) {
    h2_text <- h2_node %>% html_text()
    li_nodes <- h2_node %>% 
      html_nodes(xpath = "following-sibling::ol[1]/li")
    li_texts <- li_nodes %>% html_text()
    list(h2_text = h2_text, li_texts = li_texts)
  }) %>% dplyr::bind_rows()

Solution

  • TL;DR You need to replace the ol with *. We want to get the first tag after the h2 no matter if it is ol or ul :)

    Everything after the "Best ChatGPT prompts for Games (Team collaboration)" uses ul not ol tag for lis, so the next ol for all of them are taken from the same place, which is the "Additional resources for ChatGPT prompts" section.

    You need to replace the ol with *. We want to get the first tag after the h2 no matter if it is ol or ul.

    library(rvest)
    url <- "https://xxxxxxxxx.com/blog/"
    html <- read_html(url)
    
    df <- html %>%
      html_nodes(xpath = "//h2[contains(@id,'best')]") %>%
      lapply(function(h2_node) {
        h2_text <- h2_node %>% html_text()
        li_nodes <- h2_node %>% 
          html_nodes(xpath = "following-sibling::*[1]/li")
        li_texts <- li_nodes %>% html_text()
        list(h2_text = h2_text, li_texts = li_texts)
      }) %>% dplyr::bind_rows()
    

    PS. I wouldn't say I like the DOM of this webpage; they should use some div nesting. The website accepts all kinds of traffic and I treat it as an educational case study.