Search code examples
mysqlgoogle-cloud-storagegoogle-compute-enginegoogle-cloud-sql

Cloud based LAMP cluster


I run a pretty customized cluster for processing large amounts of scientific data based on a basic LAMP design. In general, I run a separate MySQL server with around 128GB of ram and about 1TB of storage. Separately, I run a head node that serves as an nfs mount point for the data input of my process, and a webserver to display results. Finally, I trypically have a few compute nodes that get their jobs from a mysql table, get the data from NFS, do some heavy lifting, then put results into mysql.

I have come across a dataset I would like to process which is pretty large (1TB of input data), and I don't really have the hardware on hand to handle it. As a result, I began investigating google compute engine etc, and the prospect of scaling instances to process these data rapidly with the results stored in a mysql instance. Upon completion the mysql tables could be dumped from the cloud and brought up locally for analysis. I would have no problem deploying a MySQL server, along with the rest of the LAMP pieces and the compute nodes, but I can't quite figure out how I would do this in the cloud.

A major sticking point seems to be the lack of read/write NFS which would allow me to get the data onto several instances, crunch it, then push the results to MySQL. This is a necessary step for me as I could queue hundreds of jobs from the webserver, then have the instances (as many as 50-100) pick the jobs up by connecting to a centralized mysql instance to find out what jobs an instance needs to do and where the data is. Process the data (there is a file conversion that happens which make the write part necessary), crunch the data, then load results to mysql. I hope I'm explaining my situation clearly. This seems like a great example of a CPU intensive process that would scale nicely in the cloud, I just can't seem to put all the pieces together... Any input is appreciated!


Solution

  • It sounds quite possible; I've been doing similar things in GCE for a while now.

    NFS mount - you just need configure it as you would normally. Set up the NFS server on the head node, and then configure the clients on the slave nodes to mount it. Here and here are some basic configuration instructions for Centos 6 I used to get NFS up and running.

    Setting up a LAMP stack is very straightforward. These machines run pretty much vanilla Linux distros, so you can just use yum or apt-get to install components.

    For the cluster, you will probably end up having an image for the head node you use once, and then another image for the slave nodes that you replicate for each one.

    For the scheduler, I've used Condor and Sge successfully, but I'm sure the other ones would work just as well.

    Hope this helps.