Search code examples
kubernetesgoogle-cloud-platformgoogle-kubernetes-enginekubernetes-networkpolicycilium

GKE Dataplane v2 NetworkPolicies not working


I am currently trying to move my calico based clusters to the new Dataplane V2, which is basically a managed Cilium offering. For local testing, I am running k3d with open source cilium installed, and created a set of NetworkPolicies (k8s native ones, not CiliumPolicies), which lock down the desired namespaces.

My current issue is, that when porting the same Policies on a GKE cluster (with DataPlane enabled), those same policies don't work.

As an example let's take a look into the connection between some app and a database:

---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: db-server.db-client
  namespace: BAR
spec:
  podSelector:
    matchLabels:
      policy.ory.sh/db: server
  policyTypes:
    - Ingress
  ingress:
    - ports: []
      from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: FOO
          podSelector:
            matchLabels:
              policy.ory.sh/db: client
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: db-client.db-server
  namespace: FOO
spec:
  podSelector:
    matchLabels:
      policy.ory.sh/db: client
  policyTypes:
    - Egress
  egress:
    - ports:
        - port: 26257
          protocol: TCP
      to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: BAR
          podSelector:
            matchLabels:
              policy.ory.sh/db: server

Moreover, using GCP monitoring tools we can see the expected and actual effect the policies have on connectivity:

Expected: Expected

Actual: Actual

And logs from the application trying to connect to the DB, and getting denied:

{
  "insertId": "FOO",
  "jsonPayload": {
    "count": 3,
    "connection": {
      "dest_port": 26257,
      "src_port": 44506,
      "dest_ip": "172.19.0.19",
      "src_ip": "172.19.1.85",
      "protocol": "tcp",
      "direction": "egress"
    },
    "disposition": "deny",
    "node_name": "FOO",
    "src": {
      "pod_name": "backoffice-automigrate-hwmhv",
      "workload_kind": "Job",
      "pod_namespace": "FOO",
      "namespace": "FOO",
      "workload_name": "backoffice-automigrate"
    },
    "dest": {
      "namespace": "FOO",
      "pod_namespace": "FOO",
      "pod_name": "cockroachdb-0"
    }
  },
  "resource": {
    "type": "k8s_node",
    "labels": {
      "project_id": "FOO",
      "node_name": "FOO",
      "location": "FOO",
      "cluster_name": "FOO"
    }
  },
  "timestamp": "FOO",
  "logName": "projects/FOO/logs/policy-action",
  "receiveTimestamp": "FOO"
}

EDIT:

My local env is a k3d cluster created via:

k3d cluster create --image ${K3SIMAGE} --registry-use k3d-localhost -p "9090:30080@server:0" \
            -p "9091:30443@server:0" foobar \
            --k3s-arg=--kube-apiserver-arg="enable-admission-plugins=PodSecurityPolicy,NodeRestriction,ServiceAccount@server:0" \
            --k3s-arg="--disable=traefik@server:0" \
            --k3s-arg="--disable-network-policy@server:0" \
            --k3s-arg="--flannel-backend=none@server:0" \
            --k3s-arg=feature-gates="NamespaceDefaultLabelName=true@server:0"

docker exec k3d-server-0 sh -c "mount bpffs /sys/fs/bpf -t bpf && mount --make-shared /sys/fs/bpf"
kubectl taint nodes k3d-ory-cloud-server-0 node.cilium.io/agent-not-ready=true:NoSchedule --overwrite=true
skaffold run --cache-artifacts=true -p cilium --skip-tests=true --status-check=false
docker exec k3d-server-0 sh -c "mount --make-shared /run/cilium/cgroupv2"

Where cilium itself is being installed by skaffold, via helm with the following parameters:

name: cilium
remoteChart: cilium/cilium
namespace: kube-system
version: 1.11.0
upgradeOnChange: true
wait: false
setValues:
  externalIPs.enabled: true
  nodePort.enabled: true
  hostPort.enabled: true
  hubble.relay.enabled: true
  hubble.ui.enabled: true

UPDATE: I have setup a third environment: a GKE cluster using the old calico CNI (Legacy dataplane) and installed cilium manually as shown here. Cilium is working fine, even hubble is working out of the box (unlike with the dataplane v2...) and I found something interesting. The rules behave the same as with the GKE managed cilium, but with hubble working I was able to see this:

Hubble db connection

For some reason cilium/hubble cannot identify the db pod and decipher its labels. And since the labels don't work, the policies that rely on those labels, also don't work.

Another proof of this would be the trace log from hubble:

kratos -> db

Here the destination app is only identified via an IP, and not labels.

The question now is why is this happening?

Any idea how to debug this problem? What could be difference coming from? Do the policies need some tuning for the managed Cilium, or is a bug in GKE? Any help/feedback/suggestion appreciated!


Solution

  • Update: I was able to solve the mystery and it was ArgoCD all along. Cilium is creating an Endpoint and Identity for each object in the namespace, and Argo was deleting them after deploying the applications.

    For anyone who stumbles on this, the solution is to add this exclusion to ArgoCD:

      resource.exclusions: |
        - apiGroups:
          - cilium.io
          kinds:
          - CiliumIdentity
          - CiliumEndpoint
          clusters:
          - "*"