Search code examples
kubernetesgoogle-kubernetes-engine

making daemonset node initialisation pod run only once per node


I want to run an initialisation script on each node, and I want to run them only once.

Here I have the yaml to do some basic initialisation on each node, but once the initialisation script finish execution, the pod exits with exit code: 0 and the daemonset restarts the pod, running the initialisation script again and again.

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: test-init-node-cr
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: test-init-node-sa
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: test-init-node-cr
subjects:
- kind: ServiceAccount
  name: test-init-node-sa
  namespace: default
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: test-init-node-sa
  namespace: default
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: test-init-node
  namespace: default
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: test-init-node
      app.kubernetes.io/component: configurator
  # replicas: 3
  template:
    metadata:
      name: test-init-node
      labels:
        app.kubernetes.io/name: test-init-node
        app.kubernetes.io/component: configurator
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: k8s.amazee.io/node-configured
                operator: DoesNotExist
      hostPID: true
      hostNetwork: true
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      serviceAccount: test-init-node-sa
      containers:
      - name: init
        env:
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        command: 
        - nsenter
        - --mount=/proc/1/ns/mnt
        - --
        - bash
        - -xc
        - |
          echo "starting the magic"
          echo "*   hard  core    unlimited" >>  /etc/security/limits.d/game.conf 
          echo "*   soft  core    unlimited" >>  /etc/security/limits.d/game.conf 

        image: alpine/k8s:1.28.0
        resources:
          requests:
            cpu: 50m
            memory: 50M
        securityContext:
          runAsUser: 0
          privileged: true

Is there any way for me to prevent the daemonset from restarting the pod if the pod exits? i.e. ensuring that the initialisation only happen once per node.

I tried adding a preStop but it does not seem to have any effect. The idea is that if k8s.amazee.io/node-configured is set then the daemonset will not schedule onto that node.

          preStop:
            exec:
              command:
              - /bin/sh"
              - -c
              - kubectl label node "$MY_NODE_NAME" k8s.amazee.io/node-configured=$(date +%s)

neither does adding a semicolon (well this is expected but I thought why not give it a try)

        command: 
        - nsenter
        - --mount=/proc/1/ns/mnt
        - --
        - bash
        - -xc
        - |
          echo "starting the magic"
          echo "*   hard  core    unlimited" >>  /etc/security/limits.d/game.conf 
          echo "*   soft  core    unlimited" >>  /etc/security/limits.d/game.conf 
        - ; 
        - /bin/sh"
        - -c
        - kubectl label node "$MY_NODE_NAME" k8s.amazee.io/node-configured=$(date +%s)

Is there any way for me to prevent the daemonset from restarting the pod if the pod exits properly? i.e. ensuring that the initialisation only happen once per node.


Solution

  • Running the daemonset pod works, but it will still take up some resource and does not feel elegant.

    from @norbjd's answer, I saw this in the GCP tutorial.

    initContainers:
        - image: ubuntu:18.04
          name: node-initializer
          command: ["/scripts/entrypoint.sh"]
          env:
            - name: ROOT_MOUNT_DIR
              value: /root
          securityContext:
            privileged: true
      containers:
        - image: "gcr.io/google-containers/pause:2.0"
          name: pause
    

    The tutorial is talking about using the pause contain from google-containers to avoid a restart of the pod. However, what caught my eye was the initContainers.

    Source:

    Init containers are exactly like regular containers, except:
    
    Init containers always run to completion.
    Each init container must complete successfully before the next one starts.
    

    This gave me an idea. What if I ran 2 containers, the first being an initContainers that does all the initialisations, and the second containers will run a command to add a label to prevent scheduling, hence stopping any further pod creation/restart of the Daemonset for that particular node.

    Of course, by the same logic, both can be initContainers, but in my case I used 1 initContainers and 1 containers, since containers will wait for all initContainers to be completed so they have the same result as both initContainers.

    working example

    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: test-init-node-cr
    rules:
    - apiGroups:
      - ""
      resources:
      - nodes
      verbs:
      - get
      - patch
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: test-init-node-sa
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: test-init-node-cr
    subjects:
    - kind: ServiceAccount
      name: test-init-node-sa
      namespace: default
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: test-init-node-sa
      namespace: default
    ---
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: test-init-node
      namespace: default
    spec:
      selector:
        matchLabels:
          app.kubernetes.io/name: test-init-node
          app.kubernetes.io/component: configurator
      # replicas: 3
      template:
        metadata:
          name: test-init-node
          labels:
            app.kubernetes.io/name: test-init-node
            app.kubernetes.io/component: configurator
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: test-init-node-date
                    operator: DoesNotExist
          hostPID: true
          hostNetwork: true
          tolerations:
          - effect: NoSchedule
            key: node-role.kubernetes.io/master
          serviceAccount: test-init-node-sa
          initContainers:
          - name: init
            command:         
            - nsenter
            - --mount=/proc/1/ns/mnt
            - --
            - bash
            - -xc
            - |
              echo "starting the magic"
              echo "*   hard  core    unlimited" >>  /etc/security/limits.d/game.conf 
              echo "*   soft  core    unlimited" >>  /etc/security/limits.d/game.conf 
              echo "user00   soft  core    unlimited" >>  /etc/security/limits.d/game.conf 
            image: alpine/k8s:1.28.0
            resources:
              requests:
                cpu: 50m
                memory: 50M
            securityContext:
              runAsUser: 0
              privileged: true
    
          containers: 
          - name: add-label-to-remove-scheduling
            env:
            - name: MY_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            command:
            - sh
            - -c
            - |
              kubectl label node "$MY_NODE_NAME" test-init-node-date=$(date +%s) 
            image: alpine/k8s:1.28.0
            resources:
              requests:
                cpu: 50m
                memory: 50M
            securityContext:
              runAsUser: 0
              privileged: true
    

    Rough Explanation:

    1. create the relevant service account and permissions
    2. check if the test-init-node-date label is set. If it is, skip and do nothing
    3. create the initContainers and run the init script as needed
    4. create the Containers and add test-init-node-date label

    sample labels:

    kubernetes.io/os=linux
    node.kubernetes.io/instance-type=n2d-standard-8
    test-init-node-date=1693280374 
    

    This will then create a daemonset that runs the init pod once, start another pod to add test-init-node-date label. Since the test-init-node-date label is set, no new pods will be scheduled by the daemonset.

    And finally, to quote norbjd, to prevent accidental re-run of the init script(e.g. someone deleted the label), you can add a safeguard check before you run the script.

    if [ ! -f /etc/game-conf-limits-updated ]
    then
        echo "starting the magic"
        echo "*   hard  core    unlimited" >> /etc/security/limits.d/game.conf
        echo "*   soft  core    unlimited" >> /etc/security/limits.d/game.conf
    
        touch /etc/game-conf-limits-updated
    fi