Search code examples
architecturedistributed-systemweb-development-server

what architecture to use for website with cpu intensive server side tasks?


I am working on a hobby project website. Each user has up to several hundred megabytes of data specific to them stored in my database. The user can run various types of statistical analysis on the data that will result in graphs for the user to see the results. The user will do all this from the browser.

My question is how do I set up the server side? There needs to be support for at least a few thousand concurrent users. Each user is expected to make a few queries on their data set in a session. Clearly I cant just have a single web server.

What I am thinking so far is having a web server take in requests then a script on the web server sends a request to a cluster of several machines that do the number crunching. The cluster contains a master and several workers. All requests come to the master. The master monitors the workers and sends the request to the best available worker. The worker crunches out the numbers and sends a response back to the web server. The web server then sends the data to the user where the graphs are built.

Does that idea work? If so, how would I create a connection to the master? What would its contact info be? Is there good load balancing software out there so that I wouldn't have to develop the master?

Also, how do companies do things similar to this, or rather what is the best way to solve this problem? I tried to look it up, but could not find any specifics. Thanks in advance.


Solution

  • This is traditionally done with a pub/sub model. The exact implementation depends on what language/platform you are using, but the basic implementation is thus:

    1. The client creates one of these "query objects" which is written to a database, and drops a message on a message queue.
    2. The client starts polling the database for the results (when the application is web based, otherwise a response queue is often set up).
    3. The workers poll the queue for work when they go idle. When they find a message, they pick it up, run the "job/query/whatever", and write the results back to the database for the client to pick up.

    There are varieties on this theme, like when the request/response is small enough to fit in the queue message itself, you can eliminate the intermediate database, but this gets ugly in a web-based polling model, as you need to get the right response back to the right http response thread.