Search code examples
rweb-scrapingrvest

Can't get rvest to grab the data from a webpage


Something that should take few minutes to do has been taking me forever. I am trying to scrape listed properties information from the following URL:

https://www.immobiliare.it/search-list/?idContratto=1&idCategoria=1&idTipologia%5B0%5D=7&idTipologia%5B1%5D=12&idTipologia%5B2%5D=13&idTipologia%5B3%5D=11&criterio=rilevanza&__lang=it&fkRegione=ven&idProvincia=VE&idNazione=IT&pag=1&dtCookie=v_4_srv_3_sn_9E341BF8AC6892004B9D2502432FB6E5_perc_100000_ol_0_mul_1_app-3Aea7c4b59f27d43eb_0

My goal is to use rvest to extrapolate listed price, address/location, and all the characteristic for each unit looping over the available pages.

So far, I wanted to check if the scraping works in the first page before proceeding with dynamically looping over the rest of the pages.

library(rvest)
library(dplyr)
library(httr)

url <- "https://www.immobiliare.it/search-list/?idContratto=1&idCategoria=1&idTipologia%5B0%5D=7&idTipologia%5B1%5D=12&idTipologia%5B2%5D=13&idTipologia%5B3%5D=11&criterio=rilevanza&__lang=it&fkRegione=ven&idProvincia=VE&idNazione=IT&pag=1&dtCookie=v_4_srv_3_sn_9E341BF8AC6892004B9D2502432FB6E5_perc_100000_ol_0_mul_1_app-3Aea7c4b59f27d43eb_0"

page <- GET(url)
content <- content(page, as = "text")
parsed_html <- read_html(content)
property_price <- parsed_html %>% html_element("div.in-listingCardPrice")

However, the last sentence returns a 0 character list. It seems to me that the CSS code that is passed in the html_element function is incorrect. I've used Selector Gadget add-on tool to get the CSS selector.

Does anyone know where my code fails to grab the correct data? I am sure there's a more elegant way to do this but I am new to R and that's the best I've come up with. Thanks so much for all your help!


Solution

  • This is a dynamic site, so the approach that you are using won't work (the content you are after is rendered in the browser rather than on the server). The read_html_live() function does allow you to interact with the "live" site via a browser and should give you what you want.

    library(rvest)
    library(dplyr)
    
    url <- "https://www.immobiliare.it/search-list/?idContratto=1&idCategoria=1&idTipologia%5B0%5D=7&idTipologia%5B1%5D=12&idTipologia%5B2%5D=13&idTipologia%5B3%5D=11&criterio=rilevanza&__lang=it&fkRegione=ven&idProvincia=VE&idNazione=IT&pag=1&dtCookie=v_4_srv_3_sn_9E341BF8AC6892004B9D2502432FB6E5_perc_100000_ol_0_mul_1_app-3Aea7c4b59f27d43eb_0"
    
    # Create a live session.
    session <- read_html_live(url)
    
    # Pop up browser window showing live site.
    session$view()
    
    # Extract the price of the first listing.
    session %>% html_element("div.in-listingCardPrice")
    

    Output:

    {html_node}
    <div class="in-listingCardPrice">
    [1] <span>€ 247.000</span>