I have the following summarized html code (html_file.html).
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class="listing-wrapper__content">
<section class="card__amenities ">
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="floorSize"><span data-testid="l-icon" role="document" aria-label="Tamanho do imóvel" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 94 - 100 m² </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfRooms"><span data-testid="l-icon" role="document" aria-label="Quantidade de quartos" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 3 </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfBathroomsTotal"<span data-testid="l-icon" role="document" aria-label="Quantidade de banheiros" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span>3</p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity"><span data-testid="l-icon" role="document" aria-label="Quantidade de vagas de garagem" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg"><...</svg></span>2</p>
</section>
</div>
</body>
</html>
I managed to extract the first three elements. For example:
library(rvest)
pagee <- read_html("html_file.html")
nofrooms <- html_elements(pagee, ".listing-wrapper__content")%>%html_nodes("[itemprop='numberOfRooms']")%>%html_text()
nofrooms
Output is
" 3 "
The problem is in the last p tag. There is apparently no standard for me to be able to extract information from such a tag. I have tried the following without success:
nofgarage <- html_elements(pagee, ".listing-wrapper__content")%>%html_nodes("[aria-label='Quantidade de vagas de garagem']")%>%html_text()
nofgarage
Output is
""
The result is empty as expected, as the value I want to extract is not between the span tags.
Thanks for any help
Since it appears the that there is mostly 4 amenities, one could use xml_child()
function from xml2 to select the that amenity.
In this case there are a few listing that is missing the 4th amenity so we need to filter before attempting to extract.
See comments below.
library(rvest)
library(xml2)
library(dplyr)
url <- "https://www.zapimoveis.com.br/venda/apartamentos/ms+campo-grande/?transacao=venda&onde=,Mato%20Grosso%20do%20Sul,Campo%20Grande,,,,,city,BR%3EMato%20Grosso%20do%20Sul%3ENULL%3ECampo%20Grande,-20.464852,-54.621848,&tipos=apartamento_residencial&pagina=1"
#read page
pagee <- read_html(url)
#get the amentities section from each listing
sections <- html_elements(pagee, "section.card__amenities ")
#sections %>% html_elements("p") %>% html_text()
#create an empty vector
garages <- vector("numeric", length=length(sections))
#retrieve the 4 node value - not all apartments have a 4 values thus the need to filter
garages[xml_length(sections)==4] <- sapply(sections[xml_length(sections)==4], function(node)
{xml_child(node, 4) %>% html_text()})
#answer the final vector
garages
# [1] "2" "4" "1" "1" "1" "1" "0" "2" "2" "2" "3" "1" "1" "1" "0"