Search code examples
javamultithreadingjsoup

Finding links from the website using threads


Im working on program which gets all the links from the webiste and searches for input word. Then enters each of this links and search again and etc. Program does this 3 times (thats why n is 3). Code below does it with recursion method and seems to be working just fine.

However i would like to speed up this process by using threads. How can i implement this? From what i heard I can propably use fork/join for that.

 public static void getLinks(String url, Set<String> urls, String word, int n) {
    if(url.contains(word)) {
        System.out.println("Found: " + url);
    }

    if (urls.contains(url)) {
        return;
    }
    urls.add(url);

    if(n<3) {
        try {
            Document doc = Jsoup.connect(url).get();
            Elements elements = doc.select("a[href]");
            for (Element element : elements) {
                System.out.println(element.absUrl("href"));
                getLinks(element.absUrl("href"), urls, word, n + 1);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    } else return;
}

public static void main(String[] args) {
    Set<String> links = new HashSet<>();
    String word = "root";
    getLinks("https://example.com", links, word, 0);
}

PS in the final version of the program links thats matching with input word will be printed in GUI.


Solution

  • The simple way is to submit getLinks to a thread pool while iterating through Elements:

        static ExecutorService executorService = Executors.newCachedThreadPool();
        static List<Callable<Object>> todo = new ArrayList<>();
        public static void main(String[] args) throws ExecutionException, InterruptedException {
            getLinks();
            // Wait until all tasks are complete
            // Or use invokeAll(collection, timeout) if you want to have a maximum wait time
            executorService.invokeAll(todo);
            executorService.shutdown();
        }
    
        public static void getLinks(String url, Set<String> urls, String word, int n) {
            if(n<3) {
                try {
                    for (Element element : new ArrayList<Element>()) {
                        todo.add(Executors.callable(() -> getLinks()));
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                }
            } else {
                return;
            }
        }