Clients in different regions can't connect to Filestore

I have two GKE clusters in the same Google Cloud project, but using the same PV/PVC YAMLs one cluster can successfully mount the Filestore instance and the other cluster fails. The failed GKE cluster events look like:

Event  : Pod [podname] Unable to attach or mount volumes: unmounted volumes=[nfs-pv], unattached volumes=[nfs-pv]: timed out waiting for the condition   FailedMount
Event  : Pod [podname] MountVolume.SetUp failed for volume "nfs-pv" : mount failed: exit status 1

The Kublet logs for the failed mount:

pod_workers.go:191] Error syncing pod [guid] ("[podname](guid)"), skipping: unmounted volumes=[nfs-pv], unattached volumes=[nfs-pv]: timed out waiting for the condition
kubelet.go:1622] Unable to attach or mount volumes for pod "podname(guid)": unmounted volumes=[nfs-pv], unattached volumes=[nfs-pv]: timed out waiting for the condition; skipping pod"
mount_linux.go:150] Mount failed: exit status 1
Output: Running scope as unit: run-r1fb543aa9a9246e0be396dd93bb424f6.scope
Mount failed: mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c61546e6-9769-4e16-bd0b-c73f904272aa/volumes/kubernetes.io~nfs/nfs-pv --scope -- /home/kubernetes/containerized_mounter/mounter mount -t nfs 192.168.99.2:/mount /var/lib/kubelet/pods/c61546e6-9769-4e16-bd0b-c73f904272aa/volumes/kubernetes.io~nfs/nfs-pv
Output: mount.nfs: Connection timed out
Mounting command: chroot
Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs 192.168.99.2:/mount /var/lib/kubelet/pods/c61546e6-9769-4e16-bd0b-c73f904272aa/volumes/kubernetes.io~nfs/nfs-pv]
nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/nfs/c61546e6-9769-4e16-bd0b-c73f904272aa-nfs-pv podName:c61546e6-9769-4e16-bd0b-c73f904272aa nodeName:}" failed. No retries permitted until 2021-09-11 10:01:44.725959505 +0000 UTC m=+820955.435941160 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"nfs-pv\" (UniqueName: \"kubernetes.io/nfs/c61546e6-9769-4e16-bd0b-c73f904272aa-nfs-pv\") pod \"podname\" (UID: \"c61546e6-9769-4e16-bd0b-c73f904272aa\") : mount failed: exit status 1\nMounting command: systemd-run\nMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c61546e6-9769-4e16-bd0b-c73f904272aa/volumes/kubernetes.io~nfs/nfs-pv --scope -- /home/kubernetes/containerized_mounter/mounter mount -t nfs 192.168.99.2:/mount /var/lib/kubelet/pods/c61546e6-9769-4e16-bd0b-c73f904272aa/volumes/kubernetes.io~nfs/nfs-pv\nOutput: Running scope as unit: run-r1fb543aa9a9246e0be396dd93bb424f6.scope\nMount failed: mount failed: exit status 32\nMounting command: chroot\nMounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs 192.168.99.2:/mount /var/lib/kubelet/pods/c61546e6-9769-4e16-bd0b-c73f904272aa/volumes/kubernetes.io~nfs/nfs-pv]\nOutput: mount.nfs: Connection timed out\n"

Perhaps one clue is the two clusters are in separate regions and separate subnets? Why can Filestore connect to one cluster but not other?

Solution

After many, many hours of debugging I was able to find an answer. Based on: https://groups.google.com/g/google-cloud-filestore-discuss/c/wKTT6hEzk08

The failing GKE cluster was in a subnet outside of RFC1918, so therefore was not accepted as a Filestore client. Once we changed the subnet to be a valid RFC1918, both GKE clusters were able to successfully mount the Filestore instance.

This was extremely frustrating, given the RFC1918 requirement is not clear in the documentation or troubleshooting - and in fact other Google Cloud services worked fine with the invalid RFC1918 subnet.