Search code examples
rweb-scrapingrvest

How to scrape this


My goal is to take the html and create a data frame with 10 rows (columns works too), each for an item. I have used ChatGPT for help.

Say that I have this html (from https://www.history.com/shows/alone/cast/wyatt-black):

library(rvest)
contestant <- '    <p><strong>Here are the ten items Wyatt selected to bring on his survival journey to the bone-chilling temperatures of Northern Saskatchewan, Canada:</strong></p>
    <p>1. Cooking Pot</p>
    <p><p>2. Axe</p>
    <p><p>3. Saw</p>
    <p><p>4. Ferro Rod</p>
    <p><p>5. Sleeping Bag</p>
    <p><p>6. Snare Wire</p>
    <p><p>7. Paracord</p>
    <p><p>8. Fishing Line and Hooks</p>
    <p><p>9. Bow and Arrows</p>
    <p><p>10. Multitool</p>
    <p>'

contestant_html <- read_html(contestant)

I can then scrape it using:

contestant_items <- html_nodes(contestant_html, xpath = '//p[starts-with(text(), "1.")]/following-sibling::p')
item_list <- html_text(contestant_items[1:10])

Contained in item_list is:

item_list
 [1] ""                "2. Axe"          ""                "3. Saw"          ""               
 [6] "4. Ferro Rod"    ""                "5. Sleeping Bag" ""                "6. Snare Wire" 

There are two issues: the first is that the first item is not included. The second is that there are items that are blank.

How can I improve the scraping code to handle these problems?

Relatedly, how to handle it if the list does not begin with numbers (from https://www.history.com/shows/alone/cast/brooke-and-dave-whipple)?

contestant2 <- '<p><strong>Here are the ten items Brooke and Dave selected to bring on their survival journey to Vancouver Island:</strong></p>
<ul>
<li>Bow saw</li>
<li>Pot &#8211; vintage aluminum coffee pot, 2 quarts</li>
<li>Tarp &#8211; 12&#8242; x 12&#8242;</li>
<li>Bar of Soap</li>
<li>Rations</li>
<li>Ax &#8211; full-sized felling ax</li>
<li>Tarp &#8211; 12&#8242; x 12&#8242;</li>
<li>Fishing line and hooks</li>
<li>Pan</li>
<li>Rations</li>
</ul>'

Solution

  • I think from these two examples you may struggle to find a generalizable solution and will need to tailor the selectors to each page but you can use the following:

    library(rvest)
    
    url1 <- "https://www.history.com/shows/alone/cast/wyatt-black"
    url2 <- "https://www.history.com/shows/alone/cast/brooke-and-dave-whipple"
    
    page <- read_html(url1)
    
    page %>%
      html_elements(xpath = "/html/body/div[1]/div[2]/div/div/article/p[position() >= 9 and position() mod 2 = 1]") |>
      html_text() 
    
    # [1] "1. Cooking Pot"            "2. Axe"                    "3. Saw"                    "4. Ferro Rod"             
    # [5] "5. Sleeping Bag"           "6. Snare Wire"             "7. Paracord"               "8. Fishing Line and Hooks"
    # [9] "9. Bow and Arrows"         "10. Multitool"  
    
    page2 <- read_html(url2)
    
    page2 %>%
      html_elements(xpath = "/html/body/div[1]/div[2]/div/div/article/ul/li") %>%
      html_text()
    
    # [1] "Bow saw"                                     "Pot – vintage aluminum coffee pot, 2 quarts"
    # [3] "Tarp – 12′ x 12′"                            "Bar of Soap"                                
    # [5] "Rations"                                     "Ax – full-sized felling ax"                 
    # [7] "Tarp – 12′ x 12′"                            "Fishing line and hooks"
    # [9] "Pan"                                         "Rations"