Search code examples
javadatabasemultithreadingdistributedgrid-computing

How to create a server that distributes workunits to clients?


I need a java application that should manage a database to distribute work units to its clients. Effectively it's a grid application: the database is filled with input parameters for clients and all it's tuples must be distributed to clients that request for. After clients send their results and the server modify the database accordingly (for example marking the tuples computed).
Now let's suppose that I have a database (SQLite or MySQL) filled with tuples and that clients do a request for a group of input tuples: I want that a group of workunits are sent exclusively to a unique client, so I need to mark them "already requested by another client". If I query the db for the first (for example 5) queries and meanwhile another client makes the same request (in a multi-threaded server architecture and without any synchronization) I think there is a possibility that both clients receive the same work-units.

I imagined that solutions could be:
1) make a single-threaded server architecture ( ServerSocket.accept() is called again only after the previous client request has been served, so that the server is effectively accessed by only a client at time)
2) in a multi-threaded architecture, make the query and tuples-lock operations synchronized, so that I obtain a kind of atomicity (effectively serializing operations over the database)
3) use atomic query operations to the database server (or file, in the case of SQLite), but in this case I need help because I don't know how things really goes...

However I hope that you understood my problem: it's very similar to seti@home that distributes it's work-units but the intersection over all distributed units to its multitude of clients is null (theoretically). My non-functional needs are that the language is java and that database is SQLite or MySQL.


Solution

  • Some feedback for each of your potential solutions ...

    1) make a single-threaded server architecture ( ServerSocket.accept() is called again only after the previous client request has been served, so that the server is effectively accessed by only a client at time)

    ServerSocket.accept() will not allow you to do that, you might need some other type of synchronization to allow only one thread to be in situation of getting tuples. This basically leads you to your solutions (2).

    2) in a multi-threaded architecture, make the query and tuples-lock operations synchronized, so that I obtain a kind of atomicity (effectively serializing operations over the database)

    Feasible, easy to implement and a common way to approach the problem. Only issue is how much you care about performance, latency and throughput because if you have many of those clients and the work units time span is very short then the clients might end up 90% of time locked in wait to get the "token".

    Possible solution to that issue. Use a hashed based distribution for work units. Let's say you have 500 work units to be shared between 50 clients. You give IDs to you work units in such a way that you which clients will get certain work units. In the end, you can assign nodes with a simple module operation:

    assigned_node_id = work_unit_id % number_of_working_nodes

    This technique, called pre-allocation, doesn't work for all type of problems so it depends on your application. Use this approach if you have many short running processes.

    3) use atomic query operations to the database server (or file, in the case of SQLite), but in this case I need help because I don't know how things really goes...

    It's in essence same as (2) but in case you are able to do this, which I doubt you can with just SQL, you would end up tied up to some specific features of your RDBMS. Most likely you would have to some non-standard SQL procedures to achieve this solutions. And, it doesn't fix the issues you would find with solution 2.

    Summary

    Solution 2 is more likely to work in 90% of the cases, the longer the tasks are the better for this solution. If the tasks are very short in time definitely go for a pre-allocation based algorithm.

    With solution 3 you give up portability and flexibility.

    DRY: try some other Open Source systems ...

    There are few Open Source java projects that already deal with this kind of issue, they might be an overkill for you but I think it's worth mentioning them ...

    http://www.gridgain.com/

    http://www.jppf.org/