The code below returns text of li
after h2
id containing best
. The issue is that it returns even when h2
id does not contain best
. To debug it, see the output of li_texts
from Best ChatGPT prompts for Sales
, it gets repeated. The output should stop till Best ChatGPT prompts for Games
library(rvest)
url <- "https://xxxxxxxxxxx.com/blog/"
html <- read_html(url)
df <- html %>%
html_nodes(xpath = "//h2[contains(@id,'best')]") %>%
lapply(function(h2_node) {
h2_text <- h2_node %>% html_text()
li_nodes <- h2_node %>%
html_nodes(xpath = "following-sibling::ol[1]/li")
li_texts <- li_nodes %>% html_text()
list(h2_text = h2_text, li_texts = li_texts)
}) %>% dplyr::bind_rows()
TL;DR You need to replace the ol
with *
. We want to get the first tag after the h2
no matter if it is ol
or ul
:)
Everything after the "Best ChatGPT prompts for Games (Team collaboration)" uses ul
not ol
tag for li
s, so the next ol
for all of them are taken from the same place, which is the "Additional resources for ChatGPT prompts" section.
You need to replace the ol
with *
. We want to get the first tag after the h2
no matter if it is ol
or ul
.
library(rvest)
url <- "https://xxxxxxxxx.com/blog/"
html <- read_html(url)
df <- html %>%
html_nodes(xpath = "//h2[contains(@id,'best')]") %>%
lapply(function(h2_node) {
h2_text <- h2_node %>% html_text()
li_nodes <- h2_node %>%
html_nodes(xpath = "following-sibling::*[1]/li")
li_texts <- li_nodes %>% html_text()
list(h2_text = h2_text, li_texts = li_texts)
}) %>% dplyr::bind_rows()
PS. I wouldn't say I like the DOM of this webpage; they should use some div nesting. The website accepts all kinds of traffic and I treat it as an educational case study.