Search code examples
htmlrweb-scrapingrvest

Rvest html_nodes span div other items


I'm scrapping through this html and I want to extract the text inside the <span data-testid="distance">

<span class="class1">
<span data-testid="distance">the text i want</span>
</span>
<span class="class2">
<span class="class1"><span>the other text i'm obtaining</span>
</span>

distancia <- hoteles_verdes %>% 
  html_elements("span.class1") %>%
  html_text()

The question would be how to isolate the data-testid="distance" on the html elements to later retrieve the html_text.

It's my first question posting. thanks!


Solution

  • You can use a CSS attribute selector.

    For example, the [attribute|="value"] selector to select attribute "data-testid" with value = "distance" (note the single and double quotes):

    library(rvest)
    
    hoteles_verdes %>% 
      html_nodes('[data-testid|="distance"]') %>% 
      html_text()
    

    Result:

    [1] "the text i want"
    

    Data:

    hotel_verdes <- read_html('<span class="class1">
                               <span data-testid="distance">the text i want</span>
                               </span>
                               <span class="class2">
                               <span class="class1"><span>the other text im obtaining</span>
                               </span>')