Search code examples
javauser-agenturlconnection

java SocketTimeoutException


I am trying to read title from the https://www.groupon.pl/deals/ga-hotel-alpin-17 site (this is problem specific to this particular site)

address = "https://www.groupon.pl/deals/ga-hotel-alpin-17";
URL url = new URL(address);
URLConnection httpcon = url.openConnection();
httpcon.setConnectTimeout(5000);
httpcon.setReadTimeout(5000);
httpcon.addRequestProperty("User-Agent", "Mozilla/4.0");
response = httpcon.getInputStream();
Scanner scanner = new Scanner(response);
String responseBody = scanner.useDelimiter("\\A").next();
String title = responseBody.substring(responseBody.toUpperCase().indexOf("<TITLE>") + 7, responseBody.toUpperCase().indexOf("</TITLE>"));

I get 403 or SocketTimeoutException:

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    at sun.security.ssl.InputRecord.read(InputRecord.java:503)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)

There is no problem to get this site e.g. with simple wget command.

I suspect that server somehow do not want to be queried by Java but why setting user-agent doesn't help? Anything more can be done to pretend a real browser behavior? Any ideas?


Solution

  • I found the answear, the below program works! The key to success was to observe exactly what request headers browser sends and use them. There was missing "accept-encoding" header

    import sun.misc.IOUtils;
    import java.io.*;
    import java.net.URL;
    import java.net.URLConnection;
    import java.util.Scanner;
    
    public class Program {
    
    
        public static void main(String[] args) throws IOException {
            System.out.println("Hello World!");
    
            String address = "https://www.groupon.pl/deals/ga-hotel-alpin-17";
            URL url = new URL(address);
            URLConnection httpcon = url.openConnection();
            httpcon.setConnectTimeout(5000);
            httpcon.setReadTimeout(5000);
    
    //        httpcon.addRequestProperty("Host", "www.groupon.pl");
            httpcon.addRequestProperty("User-Agent", "Mozilla/5.0 (X11; Fedora; Lin… Gecko/20100101 Firefox/54.0");
    //        httpcon.addRequestProperty("Accept", "text/html,application/xhtml+x…lication/xml;q=0.9,*/*;q=0.8");
    //        httpcon.addRequestProperty("Accept-Language", "en-US,en;q=0.5");
            httpcon.addRequestProperty("Accept-Encoding", "utf-8");
    //        httpcon.addRequestProperty("DNT", "1");
    //        httpcon.addRequestProperty("Connection", "keep-alive");
    //        httpcon.addRequestProperty("Upgrade-Insecure-Requests", "1");
    
            InputStream response = httpcon.getInputStream();
    
    
    
            Scanner scanner = new Scanner(response);
            String responseBody = scanner.useDelimiter("\\A").next();
            String title = responseBody.substring(responseBody.toUpperCase().indexOf("<TITLE>") + 7, responseBody.toUpperCase().indexOf("</TITLE>"));
    
            System.out.println("End!" + title);
        }
    }
    

    The commented headers are not necessary.

    Cheers! Lukasz

    PS. sad that I get -2 for this question...