we are running an Anthos cluster on VMWare and having some issues pulling container images from the registry.k8s.io registry. We are seeing error messages for e.g.
Failed to pull image "registry.k8s.io/csi-secrets-store/driver-crds:v1.3.3": rpc error: code = Unknown desc = failed to pull and unpack image "registry.k8s.io/csi-secrets-store/driver-crds:v1.3.3": failed to resolve reference "registry.k8s.io/csi-secrets-store/driver-crds:v1.3.3": failed to do request: Head "https://europe-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/csi-secrets-store/driver-crds/manifests/v1.3.3": x509: certificate signed by unknown authority
Warning Failed 3s (x4 over 88s) kubelet Error: ErrImagePull
We've checked our firewall rules and there is nothing blocking access so I'm thinking its an issue with the trusted certs in the Anthos node images. We're using the default Ubuntu containerd image, which i believe is based on Ubuntu 18.04. Our Anthos version is 1.14.1-gke.39.
If I try and curl https://europe-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/csi-secrets-store/driver-crds/manifests/v1.3.3
from the Anthos Admin workstation (also uses a Google provided OS image) I also get an error: curl failed to verify the legitimacy of the server and therefore could not establish a secure connection to it.
If I do the same from our jumpbox (in same VLAN), which uses a standard Ubuntu 20.04.5 its all ok. I can also pull the image on that box.
So, I'm thinking the issue is with the Anthos OS images, and it seems the only solution is for Google to update them with the required certs. I don't think installing them ourselves is very practical.
And we're not the only ones experiencing this issues it seems https://www.googlecloudcommunity.com/gc/Anthos/Anthos-config-management-operator/m-p/542966#M275
Any suggestions for other things to try or possible solutions welcome!
Check Image pull error was caused by certificate issue : You can try to wipe the kind cluster container and do an upgrade.
Check there is a mismatch in the files you uploaded. Check the registry server certificate in the below files/object are all the same and also check issued year:
a. Checkpoint yaml
b. Onpremadmincluster CR object
c. Openssl output
Check that the certificate in the registry_ca.crt file
is different from the above and it was issued in a different year. Because the openssl command should show the certificate being used, If so, assume the registry_ca.crt
is incorrect.
Regarding the checkpoint file, please correct the registry_ca.crt file
with the current certificate and then try to upgrade.
Note : Please don't re-try the upgrade after modifying the checkpoint. I suspect the issue happened because there is an inconsistency between the source of the truth and the actual repo certificate.
To further investigate & clear your issue, you need to collect more information about the repo and cluster setup :
openssl s_client -showcerts -connect kul-tools-005.kul.uc.int:5000 </dev/null
", you can run this command in the admin workstationprivateRegistry.caCertPath
" in the admin-cluster yamlAlso refer to GCP official document on Troubleshoot Anthos clusters on VMware authentication issues (Refresh token expired):
The following issue occurs when the refresh token in the kubeconfig file has expired:
Unable to connect to the server: Get {DISCOVERY_ENDPOINT}: x509: certificate signed by unknown authority
To resolve this issue, run the
gcloud anthos auth login
command again.
*EDIT
Sometimes this seems isolated to the ca-certificate bundle on your nodes. Check you may have made a change to the Firewall and added a GRE tunnel which requires a cert in order to work. If so, try below :
Try to add the certificate to the ca-certificates bundle by following the steps here.
Try to add the cert directly to the /etc/ssl/certs
directory
Try recycling all of the certificates
(During your testing, you may see curl requests over http (no ssl) block at the firewall level, At this point it is safe to say that this is not really an Anthos issue and more a networking/OS configuration issue).