Search code examples
htmlrweb-scrapingrvest

How to extract the value of an apparently non-standard html tag in r


I have the following summarized html code (html_file.html).

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class="listing-wrapper__content">
<section class="card__amenities ">
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="floorSize"><span data-testid="l-icon" role="document" aria-label="Tamanho do imóvel" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 94 - 100 m² </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfRooms"><span data-testid="l-icon" role="document" aria-label="Quantidade de quartos" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 3 </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfBathroomsTotal"<span data-testid="l-icon" role="document" aria-label="Quantidade de banheiros" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span>3</p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity"><span data-testid="l-icon" role="document" aria-label="Quantidade de vagas de garagem" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg"><...</svg></span>2</p>
</section>
</div>
</body>
</html>

I managed to extract the first three elements. For example:

library(rvest)
pagee <- read_html("html_file.html") 
nofrooms <- html_elements(pagee, ".listing-wrapper__content")%>%html_nodes("[itemprop='numberOfRooms']")%>%html_text()
nofrooms

Output is

" 3 "

The problem is in the last p tag. There is apparently no standard for me to be able to extract information from such a tag. I have tried the following without success:

nofgarage <- html_elements(pagee, ".listing-wrapper__content")%>%html_nodes("[aria-label='Quantidade de vagas de garagem']")%>%html_text()
nofgarage

Output is

""

The result is empty as expected, as the value I want to extract is not between the span tags.

Thanks for any help


Solution

  • Since it appears the that there is mostly 4 amenities, one could use xml_child() function from xml2 to select the that amenity.
    In this case there are a few listing that is missing the 4th amenity so we need to filter before attempting to extract.
    See comments below.

    library(rvest)
    library(xml2)
    library(dplyr)
    
    url <- "https://www.zapimoveis.com.br/venda/apartamentos/ms+campo-grande/?transacao=venda&onde=,Mato%20Grosso%20do%20Sul,Campo%20Grande,,,,,city,BR%3EMato%20Grosso%20do%20Sul%3ENULL%3ECampo%20Grande,-20.464852,-54.621848,&tipos=apartamento_residencial&pagina=1"
    
    #read page
    pagee <- read_html(url)
    
    #get the amentities section from each listing
    sections <- html_elements(pagee, "section.card__amenities ")
    #sections %>% html_elements("p") %>% html_text()
    
    #create an empty vector
    garages <- vector("numeric", length=length(sections))
    
    #retrieve the 4 node value - not all apartments have a 4 values thus the need to filter
    garages[xml_length(sections)==4] <- sapply(sections[xml_length(sections)==4], function(node) 
                                       {xml_child(node, 4) %>% html_text()})
    
    #answer the final vector
    garages
    # [1] "2" "4" "1" "1" "1" "1" "0" "2" "2" "2" "3" "1" "1" "1" "0"