Search code examples
kubernetesgoogle-cloud-platformgoogle-kubernetes-enginekubernetes-pvcgoogle-cloud-filestore

Unable to mount GCP Filestore PVC to job pods


I have a kubernetes job (with parallelism: 50) running on GKE Autopilot cluster, that needs more storage than maximum ephemeral storage provisioned by Autopilot cluster per node (i.e. 10Gi). As I need ReadWriteMany access for pods on the storage, I decided on GCP Filestore (though it would've been nice if minimum instance size for Filestore was less than 1 TiB) for creating PVC that can be mounted on job pods, but job pods are stuck in ContainerCreating state and looking at the event logs, MountVolume.MountDevice failure seems to be the reason:

 Warning  FailedScheduling  11m                   gke.io/optimize-utilization-scheduler  0/12 nodes are available: 11 Insufficient memory, 12 Insufficient cpu. preemption: 0/12 nodes are available: 12 No preemption victims found for incoming pod..
  Normal   TriggeredScaleUp  11m                   cluster-autoscaler                     pod triggered scale-up
  Normal   Scheduled         6m39s                 gke.io/optimize-utilization-scheduler  Successfully assigned default/mypod-7l5k9 to gk3-mycluster-3-e79620bd-jvsg
  Warning  FailedMount       4m8s (x6 over 4m39s)  kubelet                                MountVolume.MountDevice failed for volume "pvc-435bf565-25f0-43f7-86d4-b3ecadce43a3" : rpc error: code = Aborted desc = An operation with the given volume key modeInstance/asia-northeast1-b/pvc-435bf565-25f0-43f7-86d4-b3ecadce43a3/vol1 already exists.
 --- Most likely a long process is still running to completion. Retrying.
  Warning  FailedMount  2m19s                kubelet  Unable to attach or mount volumes: unmounted volumes=[my-mounted-storage], unattached volumes=[kube-api-access-4gs6h shared-storage]: timed out waiting for the condition
  Warning  FailedMount  96s (x2 over 4m39s)  kubelet  MountVolume.MountDevice failed for volume "pvc-435bf565-25f0-43f7-86d4-b3ecadce43a3" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount  5s (x2 over 4m36s)   kubelet  Unable to attach or mount volumes: unmounted volumes=[my-mounted-storage], unattached volumes=[my-mounted-storage kube-api-access-4gs6h]: timed out waiting for the condition

Here's my PVC and Job manifest:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: podpvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: standard-rwx
  resources:
    requests:
      storage: 1Ti
apiVersion: batch/v1
kind: Job
metadata:
  name: mypod
  labels:
    app.kubernetes.io/name: mypod
spec:
  parallelism: 50
  template:
    metadata:
      name: mypod
    spec:
      serviceAccountName: workload-identity-sa
      volumes:
      - name: my-mounted-storage
        persistentVolumeClaim:
          claimName: podpvc
      containers:
      - name: mypod-container
        image: mypod-image:staging-0.1
        imagePullPolicy: Always
        env:
        - name: env
          value: "stg"
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
        volumeMounts:
        - name: my-mounted-storage
          mountPath: /mnt/data
      restartPolicy: OnFailure

Both PV and PVC seems to be healthy and bound, and there doesn't seem to be any existing volume attachments on the nodes (kubectl describe nodes | grep Attach). I've also tried deleting both the PVC and job, and recreating them but the issue persists.

enter image description here


Solution

  • Below checkpoints can help you to resolve your issue:

    1. Checking if the filestore is in default network:

    Check if the GKE cluster and filestore are created under a non-default network, and use the GKE supported storageClasses: standard-rwx, enterprise-rwx, premium-rwx, which you can find in the networking section of cluster. This would cause the Filestore instance to provision in a default network. This results in the mount failing as Filestore (default network) cannot be mounted on the nodes (non-default network).

    To resolve this issue, you need to specify the network parameter for the Filestore mount to match the network of the GKE cluster by adding the storageclass.parameters.network field as follows:

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: filestore-example
    provisioner: filestore.csi.storage.gke.io
    volumeBindingMode: Immediate
    allowVolumeExpansion: true
    parameters:
      tier: standard
      network: default
    

    2. Check the IP addresses:

    Check if the IP address of the Filestore and the IP address present in the PVC are different. The PVC should contain the IP address of the filestore and the name of the filestore. If they are different, try editing the YAML file and setting the correct IP address in the PVC.

    For more information follow this document.