Search code examples
rjsonxmlweb-scrapingrcurl

Web scraping of real estates [r]


I would like to automatically collect specific data on a real estate ad website. I have used packages tidyverse and jsonlite as a starting point. Using these I am able to collect what I am interested in too some extent.

Let's use this estates webpage: https://www.sreality.cz/hledani/prodej/byty

# Libraries -----------------------------------------------------------------------------------
library(jsonlite)
library(tidyverse)

# Web Page of Estates: https://www.sreality.cz/hledani/prodej/byty/praha

# Select one region
id = 10

A = paste0("https://www.sreality.cz/api/cs/v2/estates?category_main_cb=1&category_type_cb=1&locality_region_id=", id, "&page=")
B= paste("&per_page=40&tms=1583500044717")
C = paste0(A,1,B)
D = fromJSON(C)


Res <- 
  D$`_embedded`$estates %>% 
  mutate(D$`_embedded`$estates$hash_id) %>% 
  as_data_frame()

Res %>% view()

This way the Res object contains basic information of interest such as price, also using regular expression we can obtain number of rooms etc. However, there is some information I am intereted in missing such as Floor number (Podlaží), Type of ownership (Vlastnictví) and others.

Lets loot at one perticular estate

Res$hash_id[1]

will return estate ID number we can then google for this offer using the ID.

and we will find following page:

https://www.sreality.cz/detail/prodej/byt/2+kk/praha-praha-4-milevska/822296156#img=0&fullscreen=false

Here we can see that information about Floor (Podlaží:5. podlaží) is at disposal. However, in D object there is no information about Floor (podlaží) nor about 'Vlastnictví' at disposal. I would like to be able to scrape this information as well about all the estates. Is there any way how one can do so in R?


Solution

  • You need additional API calls to get your desired data. The new URL for each hash_id is

    https://www.sreality.cz/api/cs/v2/estates/<hash_id>?tms=<timestamp>
    

    Consider this workflow

    mas_url <- "https://www.sreality.cz"
    
    get_links <- function(url, id, page) {
      tmp <- paste0(
        url, 
        "/api/cs/v2/estates?category_main_cb=1&category_type_cb=1&locality_region_id=", id, 
        "&page=", page, 
        "&per_page=40&tms=1583500044717"
      )
      links <- jsonlite::fromJSON(tmp)$`_embedded`$estates$hash_id
      tms <- as.character(round(as.double(Sys.time())*1000))
      paste0(url, "/api/cs/v2/estates/", links, "?tms=", tms)
    }
    
    # I only scraped three websites for test
    res <- lapply(get_links(mas_url, 10, 1)[10:12], jsonlite::fromJSON)
    

    Each element in the list res corresponds to one estate. The information you need can be found at, for example,

    res[[1L]]$items
    

    Output

       negotiation                   name   notes                 value currency      type          unit topped
    1        FALSE           Celková cena bez DPH             8 950 400       Kc price_czk za nemovitost     NA
    2           NA             ID zakázky    NULL                  D401     <NA>    string          <NA>     NA
    3           NA            Aktualizace    NULL                  Dnes     <NA>    edited          <NA>   TRUE
    4           NA                 Stavba    NULL               Cihlová     <NA>    string          <NA>     NA
    5           NA           Stav objektu    NULL            Novostavba     <NA>    string          <NA>     NA
    6           NA            Vlastnictví    NULL                Osobní     <NA>    string          <NA>     NA
    7           NA       Umístení objektu    NULL          Centrum obce     <NA>    string          <NA>     NA
    8           NA                Podlaží    NULL 4. podlaží z celkem 5     <NA>    string          <NA>     NA
    9           NA          Užitná plocha    NULL                    77     <NA>      area            m2     NA
    10          NA       Plocha podlahová    NULL                    83     <NA>      area            m2     NA
    11          NA                 Terasa    NULL                  TRUE     <NA>   boolean          <NA>     NA
    12          NA                  Sklep    NULL                  TRUE     <NA>   boolean          <NA>     NA
    13          NA                  Garáž    NULL                  TRUE     <NA>   boolean          <NA>     NA
    14          NA      Datum nastehování    NULL            30.01.2023     <NA>      date          <NA>     NA
    15          NA Datum zahájení prodeje    NULL            01.08.2020     <NA>      date          <NA>     NA
    16          NA                  Výtah    NULL                  TRUE     <NA>   boolean          <NA>     NA