Search code examples
pythonexcelvbawebscreen-scraping

Get the content of a public website with Python


I am faced with a real mystery:

VBA

 .send "land_abk=sh&ger_name=Norderstedt&order_by=2&ger_id=X1526"

Python

headers = {'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding':'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive','Referer': 'https://url'}

A click on link leads to the last child-ULR and to the details. I tried really everthing to get data from the 3. site, with POST, GET, VBA, PYTHON-Referer, wihtout success. I just get header response 200 & header-content, but not a single letter from the sourcecode, just an error without any description. The only way to open this 3rd page without error and with content is to click the link on the 2nd page. This is a completely public website, no reason to build in referer or any other encryption. So what is the problem and how to fix it?


Solution

  • Your headers should work fine, as long as you include the correct referrer. Maybe there is something wrong in your way of receiving the html. This works for me:

    Using urllib3

    import urllib3
    from bs4 import BeautifulSoup
    
    URL = "https://www.zvg-portal.de/index.php?button=showZvg&zvg_id=755&land_abk=sh"
    headers = {
        "Referer": "https://www.zvg-portal.de/index.php?button=Suchen",
    }
    
    http = urllib3.PoolManager()
    response = http.request("GET", URL, headers=headers)
    html = response.data.decode("ISO-8859-1")
    
    soup = BeautifulSoup(html, "lxml")
    print(soup.select_one("tr td b").text)
    # >> 0061 K 0012/ 2019
    

    Using requests

    import requests
    
    URL = "https://www.zvg-portal.de/index.php?button=showZvg&zvg_id=755&land_abk=sh"
    
    headers = {
        "Referer": "https://www.zvg-portal.de/index.php?button=Suchen",
    }
    html = requests.get(URL, headers=headers).text
    
    print("Versteigerung im Wege der Zwangsvollstreckung" in html)
    # >> True
    

    Using Python 2:

    import urllib2
    
    URL = "https://www.zvg-portal.de/index.php?button=showZvg&zvg_id=755&land_abk=sh"
    
    req = urllib2.Request(URL)
    req.add_header("Referer", "https://www.zvg-portal.de/index.php?button=Suchen")
    html = urllib2.urlopen(req).read()
    
    print("Versteigerung im Wege der Zwangsvollstreckung" in html)
    # >> True