database-design elasticsearch docker vagrant mesos

What is the recommended setup for an Elasticsearch cluster that contains data at the scale of TBs and above?

Currently, I have several Elasticsearch nodes running on several bare metal machines containing indices at the size of TBs. We're in the process of restructuring our infrastructure and I'm not sure if this is the best way.

I have been looking at Docker, Mesos, and Vagrant as alternatives, but I'm not sure if they are even possible. There are four situations I think are relevant (along with the issue I had):

Mesos-Elasticsearch: This package runs Elasticsearch on Mesos. This seems great, but it seems it only allows scaling of data nodes at small disk size. Also, there are no master/client nodes. The package is rather alpha on Github at the moment - I received a 'No route to Host' and MasterNotDiscoveredException error on their default setup. Does anybody have experience with this?
Docker: I'm not too familiar with containers, but Dockerhub has several containers for Elasticsearch. Also, Mesos allows containers to be run on top of it. I'm concerned about the low disk space in each container since my data is in the scale of TBs. Also, the data is persistent. Is resizing the disk of the container feasible or is there a different setup for Docker containers?
Vagrant VMs: I would imagine having a VM for each ES node being suitable to allocate resources. Is there any substantial benefits to this when compared to running on bare metal? This doesn't seem to be compatible with Mesos.
Bare-metal: This is the current setup.

I would like to know which of the four is your preferred setup for an Elasticsearch cluster at the TB level. Pros and cons of each option?

Solution

I'm the author of the Apache Mesos Elasticsearch Framework. I would recommend that you try all of these approaches and pick whichever you have the best experience with. And when it comes to performance, make sure you have performance requirements in mind and then perform tests. There are other things to consider, too. Which I'll touch upon in the questions.

The Elasticsearch Framework for mesos is the most resilient of these four options. Elasticsearch nodes are run as Mesos tasks. If any of the tasks fail (hardware or software failure) they are restarted somewhere else within the Mesos cluster. If you want to add nodes (to increase performance) or remove nodes (to reduce resource use) this is as simple as sending a one-line curl request. The data is very safe. The default configuration (can be overridden) replicates all data to all nodes. So the cluster can suffer a catastrophic event and be left with a single node, and not lose any data. You can also use any Docker volume plugin to write the data to an external volume instead, so that if the tasks die, the data are still contained on the cloud volumes. There are a few other features too, check out the website. Also checkout the videos on the Container Solutions youtube channel. We're also developing tools to help make development easier, see minimesos.
This is perfectly reasonable, but you have to think how you would orchestrate your cluster. And what would happen if one or more of the containers died? Can you suffer that loss? If so, this might be the best option for DevOps (i.e. you can replicate and test against a cluster that looks like the real thing).
This is the only option I would be against. It would be fine for development, but you would see a significant performance issue in production. You could, potentially, have a full stack VM (vagrant) inside another VM (cloud). The overhead is unnecessary. Link 1, Link 2.
This is the official Elastic-recommended method and will likely provide the highest performance for a given hardware configuration. But because these are static deployments a) much of the machines' resources would be wasted (unused RAM/CPU/etc.), b) there is a significant (especially in larger organisations!) delay in provisioning new instances (compare to a single api call) and c) if an instance fails it won't be replaced and won't be fixed until someone fixes it (compare to automated failover). If your requirements for Elasticsearch are fixed, you don't need DevOps-like flexibility and you don't mind a bit of downtime, then this is probably the simplest method (but make sure you get your ES configuration right!).

So if it was me, then I would consider the dockerized setup for testing, small POC's and maybe very small production tasks. Anything more than that then I would go for the Mesos Elasticsearch option every time.