Search code examples
pythonrweb-scrapingpython-requestshttr

Web scrape table from site


I want to web scrape one table from following website: https://www.katastar.hr

To follow what I want, please open inspect, than click network. Now, when you open site you can see there is a URL: https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined

The problem is that id and status are different every time you open the site. How can I scrape output of the above request (which is a json, that is a table), when there is different GET queries every time?

I would give reproducible example, but there is nothing special I can try. I should start from home page, but I don't know how to proceed:

headers <- c(
  "Accept" = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding' = "gzip, deflate, br",
  'Accept-Language' = 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
  "Cache-Control" = "max-age=0",
  "Connection" = "keep-alive",
  "DNT" = "1",
  "Host" = "www.katastar.hr",
  "If-Modified-Since" = "Mon, 22 Mar 2021 13:39:38 GMT",
  "Referer" = "https://www.google.com/",
  "sec-ch-ua" = '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
  "sec-ch-ua-mobile" = "?0",
  "Sec-Fetch-Dest" = "document",
  "Sec-Fetch-Mode" = "navigate",
  "Sec-Fetch-Site" = "same-origin",
  "Sec-Fetch-User" = "?1",
  "Upgrade-Insecure-Requests" = "1",
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
)
p <- httr::GET(
  "https://www.katastar.hr/",
  add_headers(headers))
httr::cookies(p)

The code can be in both R and python.


Solution

  • You just need the http header Origin to make it work:

    • python
    import requests
    
    r = requests.get(
        "https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
        headers={
            "Origin": "https://www.katastar.hr"
        })
    
    print(r.json())
    

    repl.it: https://replit.com/@bertrandmartel/ScrapeKatastar

    • R
    library(httr)
    
    data <- content(GET(
      "https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
      add_headers(origin = "https://www.katastar.hr")
      ), as = "parsed", type = "application/json")
    
    print(data)
    

    To go a little further into how the website generates id and status, there is this following code in JS:

    e.prototype.getSurveyors = function(e) {
        var t = this.runbase(),
          n = this.create(t.toString(), null);
        return this.httpClient.get(s + "/position", {
          params: {
            id: t.toString(),
            status: n,
            x: String(e[0]),
            y: String(e[1])
          }
        })
    }
    e.prototype.runbase = function() {
        return Math.floor(1e7 * Math.random())
    }
    e.prototype.create = function(e, t) {
        for (var n = 0, i = 0; i < e.length; i++) n = (n << 5) - n + e.charAt(i).charCodeAt(0), n &= n;
        return null == t && (t = e), Math.abs(n).toString().substring(0, 6) + (Number(t) << 1)
    }
    

    It takes a random number id and encodes it using a specific algorithm, and puts the result into status field. The server then checks if status encoded value match the id value.

    It seems previous id values still work as in the sample above (in case there is no data sent), but you can also reproduce the JS function above like this (example in ):

    from random import randint
    import ctypes
    import requests
    
    number = randint(1000000, 9999999)
    
    def encode(rand, data):
        randStr = str(rand)
        n = 0
        for char in randStr:
            n = ctypes.c_int(n << 5).value - n + ord(char)
        n = ctypes.c_int(n & n).value
        if data is None:
            suffix = ctypes.c_int(rand << 1).value
        else:
            suffix = ctypes.c_int(data << 1).value
        return f"{str(abs(n))[:6]}{suffix}"
    
    r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position",
                     params={
                         "id": number,
                         "status": encode(number, None)
                     },
                     headers={
                         "Origin": "https://www.katastar.hr"
                     })
    print(r.json())
    
    # GET parcel Id 13241901
    parcelId = 13241901
    r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/parcelInfo",
                     params={
                         "id": number,
                         "status": encode(number, parcelId)
                     },
                     headers={
                         "Origin": "https://www.katastar.hr"
                     })
    print(r.json())
    

    repl.it: https://replit.com/@bertrandmartel/ScrapeKatastarDecode