Search code examples
python-3.xpostgresqlkubernetesistio

Application running in Kubernetes cron job does not connect to database in same Kubernetes cluster


I have a Kubernetes cluster running the a PostgreSQL database, a Grafana dashboard, and a Python single-run application (built as a Docker image) that runs hourly inside a Kubernetes CronJob (see manifests below). Additionally, this is all being deployed using ArgoCD with Istio side-car injection.

The issue I'm having (as the title indicates) is that my Python application cannot connect to the database in the cluster. This is very strange to me since the dashboard, in fact, can connect to the database so I'm not sure what might be different for the Python app.

Following are my manifests (with a few things changed to remove identifiable information):

Contents of database.yaml:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: database
  name: database
spec:
  replicas: 1
  selector:
    matchLabels:
      app: database
  strategy: {}
  template:
    metadata:
      labels:
        app: database
    spec:
      containers:
      - image: postgres:12.5
        imagePullPolicy: ""
        name: database
        ports:
        - containerPort: 5432
        env:
          - name: POSTGRES_DB
            valueFrom:
              secretKeyRef:
                name: postgres-secret
                key: POSTGRES_DB
          - name: POSTGRES_USER
            valueFrom:
              secretKeyRef:
                name: postgres-secret
                key: POSTGRES_USER
          - name: POSTGRES_PASSWORD
            valueFrom:
              secretKeyRef:
                name: postgres-secret
                key: POSTGRES_PASSWORD
        resources: {}
        readinessProbe:
          initialDelaySeconds: 30
          tcpSocket:
            port: 5432
      restartPolicy: Always
      serviceAccountName: ""
      volumes: null
status: {}
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: database
  name: database
spec:
  ports:
  - name: "5432"
    port: 5432
    targetPort: 5432
  selector:
    app: database
status:
  loadBalancer: {}

Contents of dashboard.yaml:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: dashboard
  name: dashboard
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dashboard
  strategy: {}
  template:
    metadata:
      labels:
        app: dashboard
    spec:
      containers:
      - image: grafana:7.3.3
        imagePullPolicy: ""
        name: dashboard
        ports:
          - containerPort: 3000
        resources: {}
        env:
          - name: POSTGRES_DB
            valueFrom:
              secretKeyRef:
                name: postgres-secret
                key: POSTGRES_DB
          - name: POSTGRES_USER
            valueFrom:
              secretKeyRef:
                name: postgres-secret
                key: POSTGRES_USER
          - name: POSTGRES_PASSWORD
            valueFrom:
              secretKeyRef:
                name: postgres-secret
                key: POSTGRES_PASSWORD
        volumeMounts:
          - name: grafana-datasource
            mountPath: /etc/grafana/provisioning/datasources
        readinessProbe:
          initialDelaySeconds: 30
          httpGet:
            path: /
            port: 3000
      restartPolicy: Always
      serviceAccountName: ""
      volumes:
        - name: grafana-datasource
          configMap:
            defaultMode: 420
            name: grafana-datasource
        - name: grafana-dashboard-provision
status: {}
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: dashboard
  name: dashboard
spec:
  ports:
  - name: "3000"
    port: 3000
    targetPort: 3000
  selector:
    app: dashboard
status:
  loadBalancer: {}

Contents of cronjob.yaml:

---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: python
spec:
  concurrencyPolicy: Replace
  # TODO: Go back to hourly when finished testing/troubleshooting
  # schedule: "@hourly"
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - image: python-tool:1.0.5
            imagePullPolicy: ""
            name: python
            args: []
            command:
              - /bin/sh
              - -c
              - >-
                echo "$(POSTGRES_USER)" > creds/db.creds;
                echo "$(POSTGRES_PASSWORD)" >> creds/db.creds;
                echo "$(SERVICE1_TOKEN)" > creds/service1.creds;
                echo "$(SERVICE2_TOKEN)" > creds/service2.creds;
                echo "$(SERVICE3_TOKEN)" > creds/service3.creds;
                python3 -u main.py;
                echo "Job finished with exit code $?";
            env:
              - name: POSTGRES_DB
                valueFrom:
                  secretKeyRef:
                    name: postgres-secret
                    key: POSTGRES_DB
              - name: POSTGRES_USER
                valueFrom:
                  secretKeyRef:
                    name: postgres-secret
                    key: POSTGRES_USER
              - name: POSTGRES_PASSWORD
                valueFrom:
                  secretKeyRef:
                    name: postgres-secret
                    key: POSTGRES_PASSWORD
              - name: SERVICE1_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: api-tokens-secret
                    key: SERVICE1_TOKEN
              - name: SERVICE2_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: api-tokens-secret
                    key: SERVICE2_TOKEN
              - name: SERVICE3_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: api-tokens-secret
                    key: SERVICE3_TOKEN
          restartPolicy: OnFailure
          serviceAccountName: ""
status: {}

Now, as I mentioned Istio is also a part of this picture so I have a Virtual service for the dashboard since it should be accessible outside of the cluster, but that's it.

With all of that out of the way, here's what I've done to try and solve this, myself:

  1. Confirm the CronJob is using the correct connection settings (i.e. host, database name, username, and password) for connecting to the database.

    For this, I added echo statements to the CronJob deployment showing the username and password (I know, I know) and they were the expected values. I also know those were the correct connection settings for the database because I used them verbatim to connect the dashboard to the database, which gave a successful connection.

    The data source settings for the Grafana dashboard:

    Connection settings used by Grafana data source

    The error message from the Python application (shown in the ArgoCD logs for the container):

    Connection settings used by cron job

  2. Thinking Istio might be causing this problem, I tried disabling Istio side-car injection for the CronJob resource (by adding this annotation to the metadata.annotations section: sidecar.istio.io/inject: false) but the annotation never actually showed up in the Argo logs and no change was observed when the CronJob was running.

  3. I tried kubectl execing into the CronJob container that was running the Python script to debug more but was never actually able to since the container exited as soon as the connection error occurs.

That said, I've been banging my head into a wall for long enough on this. Could anyone spot what I might be missing and point me in the right direction, please?


Solution

  • I think the problem is that your pod tries to connect to the database before the istio sidecar is ready. And thus the connection can't be established.

    Istio runs an init container that configures the pods route table so all traffic is routed through the sidecar. So if the sidecar isn't running and the other pod tries to connect to the db, no connection can be established.

    There are two solutions.

    First your job could wait for eg 30 seconds before calling main.py with some sleep command.

    Alternatively you could enable holdApplicationUntilProxyStarts. By this main container will not start until the sidecar is running.