I want to web scrape one table from following website: https://www.katastar.hr
To follow what I want, please open inspect, than click network. Now, when you open site you can see there is a URL: https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined
The problem is that id and status are different every time you open the site. How can I scrape output of the above request (which is a json, that is a table), when there is different GET queries every time?
I would give reproducible example, but there is nothing special I can try. I should start from home page, but I don't know how to proceed:
headers <- c(
"Accept" = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding' = "gzip, deflate, br",
'Accept-Language' = 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
"Cache-Control" = "max-age=0",
"Connection" = "keep-alive",
"DNT" = "1",
"Host" = "www.katastar.hr",
"If-Modified-Since" = "Mon, 22 Mar 2021 13:39:38 GMT",
"Referer" = "https://www.google.com/",
"sec-ch-ua" = '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
"sec-ch-ua-mobile" = "?0",
"Sec-Fetch-Dest" = "document",
"Sec-Fetch-Mode" = "navigate",
"Sec-Fetch-Site" = "same-origin",
"Sec-Fetch-User" = "?1",
"Upgrade-Insecure-Requests" = "1",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
)
p <- httr::GET(
"https://www.katastar.hr/",
add_headers(headers))
httr::cookies(p)
The code can be in both R and python.
You just need the http header Origin
to make it work:
import requests
r = requests.get(
"https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
headers={
"Origin": "https://www.katastar.hr"
})
print(r.json())
repl.it: https://replit.com/@bertrandmartel/ScrapeKatastar
library(httr)
data <- content(GET(
"https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
add_headers(origin = "https://www.katastar.hr")
), as = "parsed", type = "application/json")
print(data)
To go a little further into how the website generates id
and status
, there is this following code in JS:
e.prototype.getSurveyors = function(e) {
var t = this.runbase(),
n = this.create(t.toString(), null);
return this.httpClient.get(s + "/position", {
params: {
id: t.toString(),
status: n,
x: String(e[0]),
y: String(e[1])
}
})
}
e.prototype.runbase = function() {
return Math.floor(1e7 * Math.random())
}
e.prototype.create = function(e, t) {
for (var n = 0, i = 0; i < e.length; i++) n = (n << 5) - n + e.charAt(i).charCodeAt(0), n &= n;
return null == t && (t = e), Math.abs(n).toString().substring(0, 6) + (Number(t) << 1)
}
It takes a random number id
and encodes it using a specific algorithm, and puts the result into status
field. The server then checks if status
encoded value match the id
value.
It seems previous id
values still work as in the sample above (in case there is no data sent), but you can also reproduce the JS function above like this (example in python):
from random import randint
import ctypes
import requests
number = randint(1000000, 9999999)
def encode(rand, data):
randStr = str(rand)
n = 0
for char in randStr:
n = ctypes.c_int(n << 5).value - n + ord(char)
n = ctypes.c_int(n & n).value
if data is None:
suffix = ctypes.c_int(rand << 1).value
else:
suffix = ctypes.c_int(data << 1).value
return f"{str(abs(n))[:6]}{suffix}"
r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position",
params={
"id": number,
"status": encode(number, None)
},
headers={
"Origin": "https://www.katastar.hr"
})
print(r.json())
# GET parcel Id 13241901
parcelId = 13241901
r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/parcelInfo",
params={
"id": number,
"status": encode(number, parcelId)
},
headers={
"Origin": "https://www.katastar.hr"
})
print(r.json())
repl.it: https://replit.com/@bertrandmartel/ScrapeKatastarDecode