Web scraping of real estates [r]

I would like to automatically collect specific data on a real estate ad website. I have used packages tidyverse and jsonlite as a starting point. Using these I am able to collect what I am interested in too some extent.

Let's use this estates webpage: https://www.sreality.cz/hledani/prodej/byty

# Libraries -----------------------------------------------------------------------------------
library(jsonlite)
library(tidyverse)

# Web Page of Estates: https://www.sreality.cz/hledani/prodej/byty/praha

# Select one region
id = 10

A = paste0("https://www.sreality.cz/api/cs/v2/estates?category_main_cb=1&category_type_cb=1&locality_region_id=", id, "&page=")
B= paste("&per_page=40&tms=1583500044717")
C = paste0(A,1,B)
D = fromJSON(C)


Res <- 
  D$`_embedded`$estates %>% 
  mutate(D$`_embedded`$estates$hash_id) %>% 
  as_data_frame()

Res %>% view()

This way the Res object contains basic information of interest such as price, also using regular expression we can obtain number of rooms etc. However, there is some information I am intereted in missing such as Floor number (Podlaží), Type of ownership (Vlastnictví) and others.

Lets loot at one perticular estate

Res$hash_id[1]

will return estate ID number we can then google for this offer using the ID.

and we will find following page:

https://www.sreality.cz/detail/prodej/byt/2+kk/praha-praha-4-milevska/822296156#img=0&fullscreen=false

Here we can see that information about Floor (Podlaží:5. podlaží) is at disposal. However, in D object there is no information about Floor (podlaží) nor about 'Vlastnictví' at disposal. I would like to be able to scrape this information as well about all the estates. Is there any way how one can do so in R?

Solution

You need additional API calls to get your desired data. The new URL for each hash_id is

https://www.sreality.cz/api/cs/v2/estates/<hash_id>?tms=<timestamp>

Consider this workflow

mas_url <- "https://www.sreality.cz"

get_links <- function(url, id, page) {
  tmp <- paste0(
    url, 
    "/api/cs/v2/estates?category_main_cb=1&category_type_cb=1&locality_region_id=", id, 
    "&page=", page, 
    "&per_page=40&tms=1583500044717"
  )
  links <- jsonlite::fromJSON(tmp)$`_embedded`$estates$hash_id
  tms <- as.character(round(as.double(Sys.time())*1000))
  paste0(url, "/api/cs/v2/estates/", links, "?tms=", tms)
}

# I only scraped three websites for test
res <- lapply(get_links(mas_url, 10, 1)[10:12], jsonlite::fromJSON)

Each element in the list res corresponds to one estate. The information you need can be found at, for example,

res[[1L]]$items

Output

   negotiation                   name   notes                 value currency      type          unit topped
1        FALSE           Celková cena bez DPH             8 950 400       Kc price_czk za nemovitost     NA
2           NA             ID zakázky    NULL                  D401     <NA>    string          <NA>     NA
3           NA            Aktualizace    NULL                  Dnes     <NA>    edited          <NA>   TRUE
4           NA                 Stavba    NULL               Cihlová     <NA>    string          <NA>     NA
5           NA           Stav objektu    NULL            Novostavba     <NA>    string          <NA>     NA
6           NA            Vlastnictví    NULL                Osobní     <NA>    string          <NA>     NA
7           NA       Umístení objektu    NULL          Centrum obce     <NA>    string          <NA>     NA
8           NA                Podlaží    NULL 4. podlaží z celkem 5     <NA>    string          <NA>     NA
9           NA          Užitná plocha    NULL                    77     <NA>      area            m2     NA
10          NA       Plocha podlahová    NULL                    83     <NA>      area            m2     NA
11          NA                 Terasa    NULL                  TRUE     <NA>   boolean          <NA>     NA
12          NA                  Sklep    NULL                  TRUE     <NA>   boolean          <NA>     NA
13          NA                  Garáž    NULL                  TRUE     <NA>   boolean          <NA>     NA
14          NA      Datum nastehování    NULL            30.01.2023     <NA>      date          <NA>     NA
15          NA Datum zahájení prodeje    NULL            01.08.2020     <NA>      date          <NA>     NA
16          NA                  Výtah    NULL                  TRUE     <NA>   boolean          <NA>     NA