Search code examples
javamultithreadingjsouprunnablecallable

Java jsoup using threads not working


I have like pages something like this:

www.foo1.bar
www.foo2.bar
www.foo3.bar
.
.
www.foo100.bar

I am using library jsoup and connecting to each page in the same time with Thread :

Thread matchThread = new Thread(task);
matchThread.start();

Each task, connect to page like this, and parses HTML:

Jsoup.connect("www.fooX.bar").timeout(0).get();

Getting tons of these exceptions:

java.net.ConnectException: Connection timed out: connect
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:388)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:523)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:227)
at sun.net.www.http.HttpClient.New(HttpClient.java:300)
at sun.net.www.http.HttpClient.New(HttpClient.java:317)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:404)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:391)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:157)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:146)

Does jsoup allow only 1 thread simultaneously? Or what am I doing wrong? Any suggestions, of how to connect to my pages faster, since going one by one takes ages.

EDIT:

All 700 threads using this method, maybe this is the problem or something. Can this method handle this amount of threads or it is singleton?

private static Document connect(String url) {
    Document doc = null;
    try {
        doc = Jsoup.connect(url).timeout(0).get();
    } catch (IOException e) {
        System.out.println(url);
    }
    return doc; 
}

EDIT: whole thread code

public class MatchWorker implements Callable<Match>{

private Element element;

public MatchWorker(Element element) {
    this.element = element;
}

@Override
public Match call() throws Exception {
    Match match = null;
            Util.connectAndDoStuff();
    return match;
}

}

MY ALL 700 ELEMENTS:

    Collection<Match> matches = new ArrayList<Match>();
    Collection<Future<Match>> results = new ArrayList<Future<Match>>();

 for (Element element : elements) {
        MatchWorker matchWorker = new MatchWorker(element);
        FutureTask<Match> task = new FutureTask<Match>(matchWorker);
        results.add(task);

        Thread matchThread = new Thread(task);
        matchThread.start();
    }
    for(Future<Match> match : results) {
        try {
            matches.add(match.get());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

Solution

  • I tried this:

        ExecutorService executorService = Executors.newFixedThreadPool(5);
        List<Future<Void>> handles = new ArrayList<Future<Void>>();
        Future<Void> handle;
        for (int i=0;i < 12; i++) {
            handle = executorService.submit(new Callable<Void>() {
    
                public Void call() throws Exception {
                    Document d = Jsoup.connect("http://www.google.hr").timeout(0).get();
                    System.out.println(d.title());
                    return null;
                }
            });
            handles.add(handle);
        }
    
        for (Future<Void> h : handles) {
            try {
                h.get();
            } 
            catch (Exception ex) {
                ex.printStackTrace();
            }
        }
    
        executorService.shutdownNow();
    

    It finishes almost immediatley and prints the correct title. Perhaps you have a firewall issue? ("Connection timed out" means that the server could not be reached at all)

    EDIT:

    i used JSoup 1.7.1

    EDIT^2:

    AFAIK, this should prove that there are no issues on relation JSoup - Thread because it is in the end using threads..

    EDIT^3:

    Also, if you are behind a proxy, here is how you can set proxy settings.

    EDIT^4:

    public static Document connect(String url) {
        Document doc = null;
        try {
            doc = Jsoup.connect(url).timeout(0).get();
        } catch (IOException ex) {
            ex.printStackTrace();
        }
        return doc;
    }
    

    and the call function rewritten:

    public Void call() throws Exception {                    
        System.out.println(App.connect("http://www.google.hr").title());
        return null;
    }
    

    gives the same result. The only thing I can think of is some implicit static synchronization, but that doesn't make much sense since there is a TimeOut exception :/ pls post thread code

    EDIT:

    Have to go away for couple of hours. Here all all my three classes rewritten

    still work, slower, but work. I would definetly recommend usage of fixed thread pool to improve performance.

    But, i think it must be a network issue. Good luck :)

    EDIT:

    Connection timed out means that the destination server could not be reached at all. AFAIK, this means that the (server never sent / client never recieved) TCP SYN+ACK message.

    The first thing one could conclude is that the destination server is not online, but there are more possible causes to this problem, one could be that the destination server is overloaded with requests (in extreme case this is (D)DoS attack).

    Currently, you tried parrallel approach - every request is in its own thread:

    1) Making seven hundred requests in 700 threads (well not seven hundred acutally but as much as your OS can take)

    2) Making seven hundred requests through a thread pool of n<<700 threads

    First you could try to put a random sleep form 0 - 10 s in each request

    Thread.currentThread.sleep(new Random().nextInt(10000)) 
    

    But given the results so far, that probably won't work. The next thing is to replace the parallel with sequential approach as you mentioned in comment - every request is run one after another from inside the for loop of a single main thread. You could also try to put the random sleep.

    That's the gentliest(slowest) way you can go and if that doesn't work I don't know how to solve that :(

    EDIT:

    By using a thread pool of 5 threads I downloaded the titles of 1141 soccer matches succesfully.

    It is logical for a site of that kind to protect their data, so most likley while you where developing and testing (running with as many threads as you can summon repeatedly) their system identified you(r IP) as a crawler who wants all their data and they obviously don't like nor want that, so they banned you. They just decided not even to refuse your request but to play dead - hence the connection timeout. That makes sense now. Phew :)

    If this is correct you should be able to get the data thorugh a proxy but be polite and use a thread pool of < 10 threads :)