I would like to automatically collect specific data on a real estate ad website. I have used packages tidyverse
and jsonlite
as a starting point. Using these I am able to collect what I am interested in too some extent.
Let's use this estates webpage: https://www.sreality.cz/hledani/prodej/byty
# Libraries -----------------------------------------------------------------------------------
library(jsonlite)
library(tidyverse)
# Web Page of Estates: https://www.sreality.cz/hledani/prodej/byty/praha
# Select one region
id = 10
A = paste0("https://www.sreality.cz/api/cs/v2/estates?category_main_cb=1&category_type_cb=1&locality_region_id=", id, "&page=")
B= paste("&per_page=40&tms=1583500044717")
C = paste0(A,1,B)
D = fromJSON(C)
Res <-
D$`_embedded`$estates %>%
mutate(D$`_embedded`$estates$hash_id) %>%
as_data_frame()
Res %>% view()
This way the Res
object contains basic information of interest such as price, also using regular expression we can obtain number of rooms etc. However, there is some information I am intereted in missing such as Floor number (Podlaží), Type of ownership (Vlastnictví) and others.
Lets loot at one perticular estate
Res$hash_id[1]
will return estate ID number we can then google for this offer using the ID.
and we will find following page:
Here we can see that information about Floor (Podlaží:5. podlaží) is at disposal. However, in D object there is no information about Floor (podlaží) nor about 'Vlastnictví' at disposal. I would like to be able to scrape this information as well about all the estates. Is there any way how one can do so in R?
You need additional API calls to get your desired data. The new URL for each hash_id
is
https://www.sreality.cz/api/cs/v2/estates/<hash_id>?tms=<timestamp>
Consider this workflow
mas_url <- "https://www.sreality.cz"
get_links <- function(url, id, page) {
tmp <- paste0(
url,
"/api/cs/v2/estates?category_main_cb=1&category_type_cb=1&locality_region_id=", id,
"&page=", page,
"&per_page=40&tms=1583500044717"
)
links <- jsonlite::fromJSON(tmp)$`_embedded`$estates$hash_id
tms <- as.character(round(as.double(Sys.time())*1000))
paste0(url, "/api/cs/v2/estates/", links, "?tms=", tms)
}
# I only scraped three websites for test
res <- lapply(get_links(mas_url, 10, 1)[10:12], jsonlite::fromJSON)
Each element in the list res
corresponds to one estate. The information you need can be found at, for example,
res[[1L]]$items
Output
negotiation name notes value currency type unit topped
1 FALSE Celková cena bez DPH 8 950 400 Kc price_czk za nemovitost NA
2 NA ID zakázky NULL D401 <NA> string <NA> NA
3 NA Aktualizace NULL Dnes <NA> edited <NA> TRUE
4 NA Stavba NULL Cihlová <NA> string <NA> NA
5 NA Stav objektu NULL Novostavba <NA> string <NA> NA
6 NA Vlastnictví NULL Osobní <NA> string <NA> NA
7 NA Umístení objektu NULL Centrum obce <NA> string <NA> NA
8 NA Podlaží NULL 4. podlaží z celkem 5 <NA> string <NA> NA
9 NA Užitná plocha NULL 77 <NA> area m2 NA
10 NA Plocha podlahová NULL 83 <NA> area m2 NA
11 NA Terasa NULL TRUE <NA> boolean <NA> NA
12 NA Sklep NULL TRUE <NA> boolean <NA> NA
13 NA Garáž NULL TRUE <NA> boolean <NA> NA
14 NA Datum nastehování NULL 30.01.2023 <NA> date <NA> NA
15 NA Datum zahájení prodeje NULL 01.08.2020 <NA> date <NA> NA
16 NA Výtah NULL TRUE <NA> boolean <NA> NA