Search code examples
javaweb-scrapingconnectiontimeoutjsoup

Why is JSoup timing out at random places in my code?


I am currently trying to use JSoup in Java to scrape retrosheets.org for a baseball coding project I am working on.

I perform multiple JSoup connections in my code, and some of these connections are done in a loop (therefore are executed many many times). So, in total, I'm making hundreds of connections in my program to scrape the necessary data.

The program works for ~5 seconds but then gets hung up on a connection (a different one each time). Then, when I try to access the website separately in my browser the website will not load. What could be causing this? Is there an issue with performing too many connections?

Here is an example of a connection I am performing (all connections follow this same format).

doc = Jsoup.connect("https://www.retrosheet.org/boxesetc/index.html").maxBodySize(0).userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15").get();

This is the error I am getting


Solution

  • This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.