I've a Cloud Run service where I mount two volumes from a VM implementing an NFSv4 server. It's very simple and straightforward, no fancy config.
The VM and the CR service are on the same subnet, there's a firewall rule allowing 2049/TCP from CR to VM. The volumes are normally mounted correctly, ss -tuna | grep :2049
on the server shows open connections from several IPs of the subnet. I tried reading and writing files from the NFS, the application works fine.
Also, when I start a new revision, if I type an invalid mountpoint it fails to be deployed, so I expect the configuration and the connection to be correct. I made test on a VM in the same subnet of the NFS and of the Cloud Run instance, the NFS shares can be mounted, files listed, created and deleted.
What drives me crazy, is that in the Cloud Run application log I regularly find the following error:
terminated: Application failed to start: container 1: mounting volume timed out (type: nfs, name: datastore-staging-dropbox): The NFS server may not be reachable or may not exist. Check your VPC connectivity and firewall settings.
mounting volume timed out (type: nfs, name: datastore-staging-secure): The NFS server may not be reachable or may not exist. Check your VPC connectivity and firewall settings.
timed out after 30s with 0 of 3 mounts remaining
Both the NFS volumes are listed in the error.
On the NFS Virtual Machine I see nothing wrong. Server load is barely above 0.15, memory is regularly below 50%. I tried increasing the number of threads to 16, increasing grace time to 90 and reducing lease time to 15s due to the highly volatile nature of Cloud Run containers (albeit being a staging environment, thus not receiving that much traffic).
With tcmpdump I spotted the following errors: NFS4ERR_BADSESSION and NFS4ERR_STALE_CLIENTID, but this was before the tuning.
In the NFS server logs I experience frequent reconnections from two IPs, albeit having set the maximum number of instances to 1:
Dec 06 09:40:45 nfs-micro-instance rpc.mountd[36633]: v4.2 client detached: 0xdace633e6752b84d from "10.0.1.24:897"
Dec 06 09:40:46 nfs-micro-instance rpc.mountd[36633]: v4.2 client attached: 0xdace63406752b84d from "10.0.1.24:897"
Dec 06 09:40:46 nfs-micro-instance rpc.mountd[36633]: v4.2 client detached: 0xdace633f6752b84d from "10.0.1.208:1019"
Dec 06 09:40:55 nfs-micro-instance rpc.mountd[36633]: v4.2 client attached: 0xdace63416752b84d from "10.0.1.208:1019"
Dec 06 09:40:55 nfs-micro-instance rpc.mountd[36633]: v4.2 client detached: 0xdace63406752b84d from "10.0.1.24:897"
Dec 06 09:40:57 nfs-micro-instance rpc.mountd[36633]: v4.2 client attached: 0xdace63426752b84d from "10.0.1.24:897"
Dec 06 09:40:57 nfs-micro-instance rpc.mountd[36633]: v4.2 client detached: 0xdace63416752b84d from "10.0.1.208:1019"
Dec 06 09:41:06 nfs-micro-instance rpc.mountd[36633]: v4.2 client attached: 0xdace63436752b84d from "10.0.1.208:1019"
Dec 06 09:41:06 nfs-micro-instance rpc.mountd[36633]: v4.2 client detached: 0xdace63426752b84d from "10.0.1.24:897"
Dec 06 09:41:07 nfs-micro-instance rpc.mountd[36633]: v4.2 client attached: 0xdace63446752b84d from "10.0.1.24:897"
Dec 06 09:41:07 nfs-micro-instance rpc.mountd[36633]: v4.2 client detached: 0xdace63436752b84d from "10.0.1.208:1019"
I don't know how to debug further, as Cloud Run doesn't offer much options for that. Any hint would be welcome.
I ended up with the solution posted here: changing the firewall rule source from network tag to subnet apparently solved, 24h now without an error!