I was trying to scrape links from google using 600 different searches, In the process of this I started getting the following error.
Error
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...
Now I've done my research and it happens because of google scholar ban restricting you to limited searches and need to solve captch to proceed, which jsoup can't do.
Code
Document doc = Jsoup.connect("http://google.com/search?q=" + keyWord)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000)
.get();
Answers on the internet are extremely vague and doesn't provide a clear solution, someone did mention cookies can solve this issue but haven't said a single thing about "how" to do it.
Some hints to improve your scraping:
Proxies permit you to reduce chances to get caught by a captcha. You should use between 50 and 150 proxies depending on your average result set. Here are two websites that can provide some proxies: SEO-proxies.com or Proxify Switch Proxy.
// Setup proxy
String proxyAdress = "1.2.3.4";
int proxyPort = 1234;
Proxy proxy = new Proxy(Proxy.Type.HTTP, InetSocketAddress.createUnresolved(proxyAdress, proxyPort))
// Fetch url with proxy
Document doc = Jsoup //
.proxy(proxy) //
.userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
.header("Content-Language", "en-US") //
.connect(searchUrl) //
.get();
If by any mean, you get caught by captcha, you can use some online captcha solving services (Bypass Captcha, DeathByCaptcha to name a few). Below is a generic step by step procedure to get the captcha solved automatically:
--
try {
// Perform search here...
} catch(HttpStatusException e) {
switch(e.getStatusCode()) {
case java.net.HttpURLConnection.HTTP_UNAVAILABLE:
if (e.getUrl().contains("http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...")) {
// Ask online captcha service for help...
} else {
// ...
}
break;
default:
// ...
}
}
--
Jsoup //
//.cookie(..., ...) // Some cookies may be needed...
.connect(imageCaptchaUrl) //
.ignoreContentType(true) // Needed for fetching image
.execute() //
.bodyAsBytes(); // byte[] array returned...
--
This part depends on the captcha service API. You can find some services in this 8 best captcha solving services article.
Fill the form with response and send it with Jsoup
The Jsoup FormElement is a life saver here. See this working sample code for details.
The Hints for Google scrapers article can give you some more pointers for improving your code. You'll find the first two hints presented here plus some more:
&num=100
to your url to sent less requests