My goal is to take the html and create a data frame with 10 rows (columns works too), each for an item. I have used ChatGPT for help.
Say that I have this html (from https://www.history.com/shows/alone/cast/wyatt-black):
library(rvest)
contestant <- ' <p><strong>Here are the ten items Wyatt selected to bring on his survival journey to the bone-chilling temperatures of Northern Saskatchewan, Canada:</strong></p>
<p>1. Cooking Pot</p>
<p><p>2. Axe</p>
<p><p>3. Saw</p>
<p><p>4. Ferro Rod</p>
<p><p>5. Sleeping Bag</p>
<p><p>6. Snare Wire</p>
<p><p>7. Paracord</p>
<p><p>8. Fishing Line and Hooks</p>
<p><p>9. Bow and Arrows</p>
<p><p>10. Multitool</p>
<p>'
contestant_html <- read_html(contestant)
I can then scrape it using:
contestant_items <- html_nodes(contestant_html, xpath = '//p[starts-with(text(), "1.")]/following-sibling::p')
item_list <- html_text(contestant_items[1:10])
Contained in item_list
is:
item_list
[1] "" "2. Axe" "" "3. Saw" ""
[6] "4. Ferro Rod" "" "5. Sleeping Bag" "" "6. Snare Wire"
There are two issues: the first is that the first item is not included. The second is that there are items that are blank.
How can I improve the scraping code to handle these problems?
Relatedly, how to handle it if the list does not begin with numbers (from https://www.history.com/shows/alone/cast/brooke-and-dave-whipple)?
contestant2 <- '<p><strong>Here are the ten items Brooke and Dave selected to bring on their survival journey to Vancouver Island:</strong></p>
<ul>
<li>Bow saw</li>
<li>Pot – vintage aluminum coffee pot, 2 quarts</li>
<li>Tarp – 12′ x 12′</li>
<li>Bar of Soap</li>
<li>Rations</li>
<li>Ax – full-sized felling ax</li>
<li>Tarp – 12′ x 12′</li>
<li>Fishing line and hooks</li>
<li>Pan</li>
<li>Rations</li>
</ul>'
I think from these two examples you may struggle to find a generalizable solution and will need to tailor the selectors to each page but you can use the following:
library(rvest)
url1 <- "https://www.history.com/shows/alone/cast/wyatt-black"
url2 <- "https://www.history.com/shows/alone/cast/brooke-and-dave-whipple"
page <- read_html(url1)
page %>%
html_elements(xpath = "/html/body/div[1]/div[2]/div/div/article/p[position() >= 9 and position() mod 2 = 1]") |>
html_text()
# [1] "1. Cooking Pot" "2. Axe" "3. Saw" "4. Ferro Rod"
# [5] "5. Sleeping Bag" "6. Snare Wire" "7. Paracord" "8. Fishing Line and Hooks"
# [9] "9. Bow and Arrows" "10. Multitool"
page2 <- read_html(url2)
page2 %>%
html_elements(xpath = "/html/body/div[1]/div[2]/div/div/article/ul/li") %>%
html_text()
# [1] "Bow saw" "Pot – vintage aluminum coffee pot, 2 quarts"
# [3] "Tarp – 12′ x 12′" "Bar of Soap"
# [5] "Rations" "Ax – full-sized felling ax"
# [7] "Tarp – 12′ x 12′" "Fishing line and hooks"
# [9] "Pan" "Rations"