Search code examples
javamultithreadingconcurrencyjava-threads

use of multithreading for downloading in java


I'm trying to concurrently download HTML-code of websites whose urls are stored in the database (about 3 millions of entries).
It's obvious that I should use multithreading technology but I get into trouble how to do it in java.

Here's how I used to do it without multithreading:

final Connection c = dbConnect(); // register jdbc-driver and establish connection
checkRequiredDbAndTables();  // here we check the existence of the Db and necessary tables

try {
    // now get list of urls from the db
    String sql = "select id, website_url, category_id from list_of_websites";
    PreparedStatement ps = c.prepareStatement(sql);
    ResultSet rs = ps.executeQuery();

    while (rs.next()) {
    // column numeration in ResultSet is from 1 !
        final long id = rs.getInt(1);   // get website id
        final String url = rs.getString(2);   // get website url

        System.out.println("Category: " + rs.getString(3) + " " + id + " " + url);

        if ( isValidURL(url) && connectionOK(url) ) {
        // checked url syntax and connection 
            String htmlInPage = downloadHTML(url);
            if (!htmlInPage.equals("")) {
            // add result to db
                insertDataToDb( c, id, htmlInPage);
             }
        }
    }
    rs.close();
 } catch (SQLException e) {
        e.printStackTrace();
 }
    closeConnection(c);  // database connection closed

The function donloadHTML uses JSoup library to do the main work.

Feels like my task is a kind of "producer consumer problem". I suppose that it can be represented in a such way: there's a buffer, containing N links; some processes getting the links from it and downloading HTML; and a process, which aim is to load new urls from the db into the buffer as it gets empty.
But I completely don't know how to do it. I've heard of Threads and ExecutorService providing ThreadPools but its really confusing for me.


Solution

  • You may want to use a Thread pool that has fixed amount of thread. Your program will first create a thread pool. Then it will read URLs from database. When a URL is read, the program will start a new task to download its content.

    You program may maintain a queue. When a task finish downloading HTMLs, it can push the URL and the result together into a queue. When the main thread finish reading URLs and starting tasks, it can wait for the queue. Once the queue have any responses, take the response out and write it to database. The main thread can count how many responses are received, when it counts to the number of URLs, then all task was finish.

    Your program can write a class for storing the response with the URL, for example:

    class response {
        public String URL;
        public String result;
        public response(String u, String r) { this.URL = u; this.result = r; }
    }
    

    If you still have any problem implementing or understanding ( I may not explain this clear enough, it is 00:40 now and I will probably go to sleep soon. ), please leave comments. If you want code, please also leave comments.