Search code examples
javapdfbrowserdownload

PDF direct dowload issue


The site vermittelerregsiter.info allows to download a PDF file by a regular GET request: eg. https://www.vermittlerregister.info/recherche?a=pdf&registernummer=D-W-111-BHC1-55

We want to automate it [for mass loading] with JAVA but we've failed.

Failed attempts

See some eg. of what we've tried:

  1. https://medium.com/@pasanmanohara/download-a-pdf-file-from-a-url-in-the-spring-boot-java-30fa325d6ab9
  2. https://www.baeldung.com/java-download-file#using-java-io (point 2)
  3. Scrapeops own requests with browsers.

All of them return the web page rather then a PDF file.

Supposed site/server operation

I've checked and it turned out the site first checks if bot or real user (browser) is requesting and only afterward it returnes PDF:

When I try to open a PDF link in a browser (Edge and also in Chrome), then
(1) the web page opens first [and there it checks the authenticity of the browser] -- my assumption.
(2) when I request the same link again (F5), the file gets indeed loaded. The subsequent requests download PDFs immediately.
Can we try "double click" or something similar ?

Check for antibots on site

Check for antibots (at discord server) has shown that the site is void of those:

enter image description here

Update 1

Based on Mihnea's suggestion I've tried the following:

curl 'https://www.vermittlerregister.info/en/search?a=pdf&registernummer=D-W-111-BHC1-55' -H 'Cookie: session=s%3A9Q4kmQAo8-J7r9JgcxA_xBcpTGRW3ZmN.xBcVp19mrdbw%2FW0KZgSZNlj27BakNcA20m%2FjAIXRuic' > test.pdf

resulting with broken test.pdf

and error message:

curl: (3) URL using bad/illegal format or missing URL
"registernummer" is not an internal or external command, executable program, or batch file.

Update 2

It has worked out when I've included all the cookies: enter image description here

Eg. request:

curl "https://www.vermittlerregister.info/en/search?a=pdf&registernummer=D-CRQV-N63D6-52" ^
  -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7" ^
  -H ^"cookie: klaro-cookie-consent=^%^7B^%^22econda^%^22^%^3Atrue^%^2C^%^22session^%^22^%^3Atrue^%^2C^%^22cookie-consent^%^22^%^3Atrue^%^7D; emos_jcsid=AY9S7EDyMyIwXqoqRContP58fgh73OXg:1:0:0; emos_jcvid=AY9S7EDyMyIwXqoqRContP58fgh73OXg:1:0:0:0:true:1; session=s^%^3A9Q4kmQAo8-J7r9JgcxA_xBcpTGRW3ZmN.xBcVp19mrdbw^%^2FW0KZgSZNlj27BakNcA20m^%^2FjAIXRuic^" ^
  -H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0" > test2.pdf

Solution

  • First of all, it seems like they do have some anti-bot checker, when accessing the url without the ?a=pdf parameter: enter image description here

    Secondly, what I think the reason your requests are not working is because you have to pass the session cookie header alongside the request. Here is a curl example:

    curl 'https://www.vermittlerregister.info/en/search?a=pdf&registernummer=D-W-111-BHC1-55' -H 'Cookie: session=<YOUR_SESSION_COOKIE>;' > test.pdf
    

    You can get the cookie session by navigating to the URL in your browser: enter image description here