Search code examples
javaproxyweb-scrapingjsouphttp-proxy

Changing proxy while data scraping


I wrote an data scraping software using JSOUP lib in Java. I'm posting some data on this page and I'm getting results from submitted page. Everything is working perfect, but they updated site recently and now after 300-500 results this page is not available for me, its broken for next few hours. When I manually change proxy :

System.setProperty("http.proxyHost", proxy);
System.setProperty("http.proxyPort", proxyPort);

Then my app continues and everything is working fine again. Problem is because I have to manually update proxies every single time when I get an Read time exception.

Is there any other way to bypass this block-IP filter after 500+ results or I have to enter proxy by myself every time when my IP get blocked?


Solution

  • I think that the real problem is not how to switch proxies, but rather the fact that you're hitting some limits on the target machine. Please keep in mind that some servers are heavily loaded, or need to serve content to other users too. Therefore they establish some crawling quotas or other DoS limits, so that it's harder to exhaust local resources by one person doing intensive crawling. It varies from website to website, but this is something you need to check by experimenting. If your server gives you 2-3 pages/sec it is not that bad. Check for instance Heritrix crawler. By default it implements rules for "Responsible Crawling", which means that the crawler tries to be polite with remote server. For example: by default it waits 5 seconds before issuing another request to the same server. There is also a delay factor (default 5), saying that if it takes 1 second for server to reply, then we probably shouldn't issue more than 1 req in 5 sec.

    Coming back to the problem: what you need to check is:

    • how many queries you can issue to the server in what amount of time? When you discover it, try to distribute your queries within the given time-frame in order to never exceed the quota.
    • maybe the limit is bandwidth-based? How about using HTTP/1.1 and gzip compression?
    • If the remote server is supporting HTTP/1.1 maybe you can use the "connection: keep-alive" and make for example 10 or 20 queries over the same HTTP connection?
    • See if you can run your crawler during the night, maybe the server is less busy and your query queue can be downloaded faster.
    • Be prepared that your crawling may take some time.

    In any case keep in mind that crawling can be very heavy for some servers, and they still need some resources to serve other visitors. I know this is not exactly an answer to the original problem, but I think that it is a different way to solve it :)