Does the kubernetes scheduler support anti-affinity?

I'm looking at deploying Kubernetes on top of a CoreOS cluster, but I think I've run into a deal breaker of sorts.

If I'm using just CoreOS and fleet, I can specify within the unit files that I want certain services to not run on the same physical machine as other services (anti-affinity). This is sort of essential for high availability. But it doesn't look like kubernetes has this functionality yet.

In my specific use-case, I'm going to need to run a few clusters of elasticsearch machines that need to always be available. If, for any reason, kubernetes decides to schedule all of my elasticsearch node containers for a given ES cluster on a single machine, (or even the majority on a single machine), and that machine dies, then my elasticsearch cluster will die with it. That can't be allowed to happen.

It seems like there could be work-arounds. I could set up the resource requirements and machine specs such that only one elasticsearch instance could fit on each machine. Or I could probably use labels in some way to specify that certain elasticsearch containers should go on certain machines. I could also just provision way more machines than necessary, and way more ES nodes than necessary, and assume kubernetes will spread them out enough to be reasonably certain of high availability.

But all of that seems awkward. It's much more elegant from a resource-management standpoint to just specify required hardware and anti-affinity, and let the scheduler optimize from there.

So does Kubernetes support anti-affinity in some way I couldn't find? Or does anyone know if it will any time soon?

Or should I be thinking about this another way? Do I have to write my own scheduler?

Solution

Looks like there are a few ways that kubernetes decides how to spread containers, and these are in active development.

Firstly, of course there have to be the necessary resources on any machine for the scheduler to consider bringing up a pod there.

After that, kubernetes spreads pods by replication controller, attempting to keep the different instances created by a given replication controller on different nodes.

It seems like there was recently implemented a method of scheduling that considers services and various other parameters. https://github.com/GoogleCloudPlatform/kubernetes/pull/2906 Though I'm not completely clear on exactly how to use it. Perhaps in coordination with this scheduler config? https://github.com/GoogleCloudPlatform/kubernetes/pull/4674

Probably the most interesting issue to me is that none of these scheduling priorities are considered during scale-down, only scale-up. https://github.com/GoogleCloudPlatform/kubernetes/issues/4301 That's a bit of big deal, it seems like over time you could weird distributions of pods because they stay whereever they are originally placed.

Overall, I think the answer to my question at the moment is that this is an area of kubernetes that is in flux (as to be expected with pre-v1). However, it looks like much of what I need will be done automatically with sufficient nodes, and proper use of replication controllers and services.