Search code examples
daskdask-distributedpersistent-volumespersistent-volume-claimsdask-kubernetes

Add a Persistent Volume Claim to a Kubernetes Dask Cluster


I am running a Dask cluster and a Jupyter notebook server on cloud resources using Kubernetes and Helm

I am using a yaml file for the Dask cluster and Jupyter, initially taken from https://docs.dask.org/en/latest/setup/kubernetes-helm.html:

apiVersion: v1
kind: Pod
worker:
  replicas: 2 #number of workers
  resources:
    limits:
      cpu: 2
      memory: 2G
    requests:
      cpu: 2
      memory: 2G
  env:
    - name: EXTRA_PIP_PACKAGES
      value: s3fs --upgrade
# We want to keep the same packages on the workers and jupyter environments
jupyter:
  enabled: true
  env:
    - name: EXTRA_PIP_PACKAGES
      value: s3fs --upgrade
  resources:
    limits:
      cpu: 1
      memory: 2G
    requests:
      cpu: 1
      memory: 2G

an I am using another yaml file to create the storage locally.

#CREATE A PERSISTENT VOLUME CLAIM // attached to our pod config
apiVersion: 1
kind: PersistentVolumeClaim
metadata:
 name: dask-cluster-persistent-volume-claim
spec:
 accessModes:
  - ReadWriteOne #can be used by a single node -ReadOnlyMany : for multiple nodes -ReadWriteMany: read/written to/by many nodes
 ressources:
  requests:
   storage: 2Gi # storage capacity

I would like to add a persistent volume claim to the first yaml file, I couldn't figure out where the add volumes and volumeMounts. if you have an idea, please share it, thank you


Solution

  • I started by creating a pvc claim with the YAML file:

    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
      name: pdask-cluster-persistent-volume-claim
    spec:
      accessModes:
        - ReadWriteOnce #can be used by a single node -ReadOnlyMany : for multiple nodes -ReadWriteMany: read/written to/by many nodes
      resources: # https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
        requests:
          storage: 2Gi
    

    with lunching in bash:

    kubectl apply -f Dask-Persistent-Volume-Claim.yaml
    #persistentvolumeclaim/pdask-cluster-persistent-volume-claim created
    

    I checked the creation of persitent volume:

    kubectl get pv
    

    I made major changes to the Dask cluster YAML: I added the volumes and volumeMounts where I read/write from a directory /data from the persistent volume created previously, I specified ServiceType to LoadBalancer with port:

    apiVersion: v1
    kind: Pod
    scheduler:
      name: scheduler 
      enabled: true
      image:
        repository: "daskdev/dask"
        tag: 2021.8.1
        pullPolicy: IfNotPresent
      replicas: 1  #(should always be 1).
      serviceType: "LoadBalancer" # Scheduler service type. Set to `LoadBalancer` to expose outside of your cluster.
      # serviceType: "NodePort"
      # serviceType: "ClusterIP"
      #loadBalancerIP: null  # Some cloud providers allow you to specify the loadBalancerIP when using the `LoadBalancer` service type. If your cloud does not support it this option will be ignored.
      servicePort: 8786 # Scheduler service internal port.
    # DASK WORKERS
    worker:
      name: worker  # Dask worker name.
      image:
        repository: "daskdev/dask"  # Container image repository.
        tag: 2021.8.1  # Container image tag.
        pullPolicy: IfNotPresent  # Container image pull policy.
        dask_worker: "dask-worker"  # Dask worker command. E.g `dask-cuda-worker` for GPU worker.
      replicas: 2
      resources:
        limits:
          cpu: 2
          memory: 2G
        requests:
          cpu: 2
          memory: 2G
      mounts: # Worker Pod volumes and volume mounts, mounts.volumes follows kuberentes api v1 Volumes spec. mounts.volumeMounts follows kubernetesapi v1 VolumeMount spec
        volumes:
          - name: dask-storage
            persistentVolumeClaim:
             claimName: pvc-dask-data
        volumeMounts:
          - name: dask-storage
            mountPath: /save_data # folder for storage
      env:
        - name: EXTRA_PIP_PACKAGES
          value: s3fs --upgrade
    # We want to keep the same packages on the worker and jupyter environments
    jupyter:
      name: jupyter  # Jupyter name.
      enabled: true  # Enable/disable the bundled Jupyter notebook.
      #rbac: true  # Create RBAC service account and role to allow Jupyter pod to scale worker pods and access logs.
      image:
        repository: "daskdev/dask-notebook"  # Container image repository.
        tag: 2021.8.1  # Container image tag.
        pullPolicy: IfNotPresent  # Container image pull policy.
      replicas: 1  # Number of notebook servers.
      serviceType: "LoadBalancer" # Scheduler service type. Set to `LoadBalancer` to expose outside of your cluster.
      # serviceType: "NodePort"
      # serviceType: "ClusterIP"
      servicePort: 80  # Jupyter service internal port.
      # This hash corresponds to the password 'dask'
      #password: 'sha1:aae8550c0a44:9507d45e087d5ee481a5ce9f4f16f37a0867318c' # Password hash.
      env:
        - name: EXTRA_PIP_PACKAGES
          value: s3fs --upgrade
      resources:
        limits:
          cpu: 1
          memory: 2G
        requests:
          cpu: 1
          memory: 2G
      mounts: # Worker Pod volumes and volume mounts, mounts.volumes follows kuberentes api v1 Volumes spec. mounts.volumeMounts follows kubernetesapi v1 VolumeMount spec
        volumes:
          - name: dask-storage
            persistentVolumeClaim:
             claimName: pvc-dask-data
        volumeMounts:
          - name: dask-storage
            mountPath: /save_data # folder for storage
    

    Then, I install my Daskconfiguration using helm:

    helm install my-config dask/dask -f values.yaml
    

    Finally, I accessed my jupyter interactively:

    kubectl exec -ti [pod-name] -- /bin/bash
    

    to examine the existence of the /data folder