Search code examples
kubernetesgoogle-kubernetes-enginekubectlpersistent-volumesvolumes

Kubernetes Stateful Sets - Mapping existing IDs to persistent/stateful pods


Thanks in advance to all those who help.

Hello, I have somewhat of a unique problem, its rather lengthy to explain but I think if solved we can expand the use-cases of Kubernetes. I think I know how to solve it, but I'm not sure if Kubernetes Stateful Sets supports the solution. Let me elaborate the domain of the problem, the problem itself, and then some of my sample solutions and maybe someone can help fill the gaps.

The Domain Space:

  • I have a set of Accounts (external to kubernetes) {Account_A, Account_B, Account_C, etc.}
  • Accounts can be active or inactive at anytime (Important: in NO PARTICULAR ORDER).
  • If activated, a pod is deployed which serves that account, and keeps a persistent volume with all of that accounts work-space/data. That account is interacted with by its unique pod identifier, and IP.
  • If deactivated, the pod is removed but the data persists so that the next time it is activated, it will be bound to the same persistent-volume-claim and therefore have access to its previous data.
  • If reactivated, a pod is redeployed that uses the previous persistent-volume-claim to resume working on the data from previous sessions

Obviously, looking at the available Kubernetes tools/objects, a stateful-set with headless-service is the ideal way of approaching this. It supports unique pods, which are assigned unique IPs, and supports persistent volumes. It also supports dynamically provisioning persistent-volumes through

The Problem:

As mentioned in the domain, accounts can be active in any order, but stateful-set pods are ordinal, meaning pod_1 has to be active for pod_2 to be active for pod_3 to be active, etc. We can't have pod_1 active and pod_3 active while pod_2 is inactive. This means if I enable Account_A, then Account_C, a pod named pod_1 will be created, and then a pod named pod_2 will be created.

Now you might say that this isn't a problem. We just keep a map that maps each account to the relative pod_number. For example, Account_A -> pod_1 and Account_C -> pod_2

Why is this a problem? Because when specifying a volumeClaimTemplate in the stateful-set, persistent-volume-claims use the pod's name as their identifier when being created. Which means that only the pod with the same name can access the same data. The data(volumes) is bound based on a pod's name, rather than the account. This creates a disconnect between accounts and their persistent volumes. Any pod with name pod_2 will always have the same data that pod_2 has always had, regardless of which account was "mapped" to pod_2.

Let me further illustrate this with an example:

1. Account_A=disabled, Account_B=disabled, Account_C=disabled (Start state, all accs disabled)
2. Account_A=enabled, Account_B=enabled, Account_C=enabled -> (All accounts are enabled)
    pod_1 is created (with volume_1) and mapped to Account_A
    pod_2 is created (with volume_2) and mapped to Account_B
    pod_3 is created (with volume_3) and mapped to Account_C
3. Account_A=disabled, Account_B=disabled, Account_C=disabled (All Accounts are disabled)
    pod_1 is deleted, volume_1 persists
    pod_2 is deleted, volume_2 persists
    pod_3 is deleted, volume_3 persists
4. Account_A=enabled, Account_B=disabled, Account_C=enabled (re-enable A and C but leave B disabled)
    pod_1 is created (with volume_1) and mapped to Account_A (THIS IS FINE)
    pod_2 is created (with **volume_2**) and mapped to Account_C (THIS IS **NOT** FINE)

Can you see the issue? Account_C is now using the data-store that should belong to Account_B (volume_2 was created and used by account_b not Account_C), because of the fact that volumes/claims are mapped by name to pod names, and pods have to be ordinal i.e. pod_1 then pod_2.

Potential Solutions

  1. Be able to support custom non-ordinal names for pods in a stateful-set. (Simplest and most effective)

    This solves everything, and keeps the benefits and tools of statefulsets. I can name my pods what I want when launched, so that when an account is enabled I just launch a pod with that accounts name, and the volume that is created is mapped to any pod with that same name. I've looked and can't seem to find a way to do this.

    (p.s.) I know that stateful-sets are supposed to be ordinal for ordering guarantees, but you can turn this off with "podManagementPolicy: Parallel"

  2. Some way to do this with labels and selectors instead?

    I'm rather new to Kubernetes, and I still don't fully understand all the moving parts. Maybe there's some way to use labels in my volumeClaimtemplate, to have volume claims attach to volumes with a certain label. i.e. Account_C mapped to pod_2 can request volume_3 because volume_3 has a label with: account=Account_C. I'm currently looking into this. If it helps, my persistent volumes are provisioned dynamically using this tool: https://github.com/kubernetes-incubator/external-storage/tree/master/nfs-client Maybe I can somehow modify it so that it adds certain labels to the persistent-volumes it creates.

  3. Ditch statefulsets and deployments and just add pods manually to the cluster

    This is not a great solution since according to docs, pods shouldn't really exist without a statefulset or deployment as a parent, and it also removes all the built-in functionality of persistent-volumes and dynamic volume provisioning, etc. For me the dealbreaker is not having volumeClaimTemplates which create or bind to an existing volumeClaim when deployed. If I could recreate this somehow, this solution would work.

  4. Create custom Kubernetes object to do this for me

    This is unideal, since it would be a lot of work and I wouldn't even know where to begin. I would also be recreating the exact same thing as a stateful-set except without the ordinal-mapping. I would have to figure out how to writeoperators and replicasets, etc. Seems like overkill for a rather simple problem.

  5. Have the persistent storage be mounted from within the pod's container This is a last resort since it completely removes the need for kubernetes. It also means I have to send the connection information to the container within the pod, and opens up an entire can of worms with security and authentication there.

I will update with anything else I find or think of. Thanks to all who help.


Solution

  • It seems to me that you're convinced that StatefulSets is a step in the right direction but that's not entirely true.

    StatefulSets have ordinality due to two reasons:

    • Creating ordered PersistentVolumeClaims
    • Being able to create FQDN endpoints for individual pods (using a headless service)

    In your case, neither seems to be true. You just need stable storage per account. While you think that #4 from your potential solutions is most unideal, it is the most "Kubernetes native" way to do it.

    Solution

    You need to write a component that manages a StatefulSet or even a Deployment per account. I say deployment because you don't need stable network identifiers for each pod. A ClusterIP service per account will be adequate for communication.

    In the Kubernetes world, these components are called controllers (without custom objects) and operators (with custom objects/manages applications).

    You can start by looking into operator-sdk and controller-runtime. Operator SDK aggregates commonly used functionalities on top of controller-runtime as a framework. It also makes developers' life easier by incorporating kubebuilder which is used to generate CRD and K8S API code for custom objects. All you need to define is structs for your custom object and a controller.

    Take a look at Operator SDK, you'll find that creating and managing custom objects is not that hard.

    Custom object based flow for your problem

    This is how I imagine the flow of your operator from what I understood in your write up.

    • One Account object maps to one account. Each object has unique metadata that maps it to its account. It should also have an active: boolean in its spec.
    • Watch for custom Account objects
    • Whenever you need to create a new account, use Kubernetes APIs to create a new Account object (will trigger an Add event in the controller) and then your controller should

      • Create/Update a PersistentVolumeClaim for the account
      • Create/Update the Deployment with the volume from created PVC specified in the Pod template
      • Catch: Add events are also received for old objects when controller restarts. So the action taken should be "Create or Update".
    • Set the active field in your custom object to false for deactivating the account (a Modify event in the controller) and then your controller should

      • Delete the deployment without touching the volume at all.
    • Set the active field to true for reactivating the account. (modify event again)
      • Recreate the deployment with the same volume specified in the Pod template
    • Delete the Account object to clean up underlying resources.

    While all of this might not make perfect sense right away, I would still suggest you to go through operator-sdk's docs and examples. IMO, that would be a leap in the right direction.

    Cheers!