Search code examples
vespa

Vespa: Failed to fetch json: Connection error: socket write error


We have done deployment for Vespa using Kubernetes on the GKE cluster with 3 nodes while creating a Dockerfile we took Vespa 7.351.32 version as a base image and added a few more things to it

  1. GCloud SDK
  2. Some script files that copy our logs to GCS
  3. workspace folder

The workspace folder contains all the necessary .xml and other files required for the Vespa deployment.

Below are the steps we execute inside three PODs to deploy and restart the config server

/opt/vespa/bin/vespa-deploy prepare /workspace && /opt/vespa/bin/vespa-deploy activate

wait (5 min)

vespa-stop-services
vespa-stop-configserver

wait(15min)

vespa-start-configserver
vespa-start-services

vespa-get-cluster-state
vespa-config-status

Then we receive the following error.

enter image description here

Please find below the screenshot for the connectivity to 2181 ports on all three pods.

enter image description here

Upon further inspection of logs(using vespa-logfmt -l error), we found that com.yahoo.container.handler.threadpool.threadpool.DefaultContainerTHreadpool bundle fails to load. Manually restarting the config server and Vespa services seems to solve the issue.

Attaching the related log below.

enter image description here

Please help us in understand the following points:

Does some service need to be running before this bundle is loaded?
Is there a path issue? If so where can we find this bundle?
Is this because of any memory issue(we have the recommended 4G)?
How does vespa load these bundles?

Below are the additional details. for the setup.

Dockerfile
FROM vespaengine/vespa:7.351.32

#Copy Neccessary Files
RUN mkdir -p workspace
COPY workspace /workspace
RUN yum install python3
COPY backup-pod.sh /

# Downloading gcloud package
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz

# Installing the package
RUN mkdir -p /usr/local/gcloud \
  && tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
  && /usr/local/gcloud/google-cloud-sdk/install.sh

# Adding the package path to local
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin
Manifest
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vespa
  namespace: vespa
  labels:
    app: vespa
spec:
  replicas: 3
  #serviceName: vespa
  selector:
    matchLabels:
      app: vespa
      name: vespa-internal
  serviceName: vespa-internal
  template:
    metadata:
      labels:
        app: vespa
        name: vespa-internal
    spec:
      serviceAccount: vespa-sa
#     nodeSelector:
#       iam.gke.io/gke-metadata-server-enabled: "true"
      containers:
      - name: vespa
        image: asia-south1-docker.pkg.dev/aurum-projec/vespa/vespa:latest
        imagePullPolicy: Always
        securityContext:
          privileged: true
        ports:
        - containerPort: 8080
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /ApplicationStatus
            port: 19071
            scheme: HTTP
        volumeMounts:
        - name: vespa-var
          mountPath: /opt/vespa/var
        - name: vespa-logs
          mountPath: /opt/vespa/logs
        resources:
          requests:
            memory: "2G"
          limits:
            memory: "2G"
  volumeClaimTemplates:
  - metadata:
      name: vespa-var
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi
  - metadata:
      name: vespa-logs
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

Solution

  • That message comes on startup, not reconfig, and relates to one of our bundles which is always present and which does consume significant resources on construction, so yes you are probably running out of memory.

    To be clear, 4Gb isn't recommended, it is the minimum you can get away with for trying it out.

    Also note that you don't need this complex, time-consuming process for deploying changes - just deploy prepare+activate is sufficient and will also work without disrupting queries and writes so that you can do it in production.