amazon-web-services kubernetes amazon-eks large-language-model ollama

How to install and run Ollama server in AWS Kubernetes cluster (EKS)?

I can install and run Ollama service with GPU in an EC2 instance and make API calls to it from a web app in the following way:

First I need to create a docker network, so that the Ollama service and my web app share the same docker network:

docker network create my-net

Then I run the official Ollama docker container to run the service:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama --net my-net ollama/ollama

Then I need to serve the model (LLM) with Ollama:

docker exec ollama ollama run <model_name> # like llama2, mistral, etc

And then I need to find out the public IP address of the Ollama service on this network, and export it as an API endpoint URL:

export OLLAMA_API_ENDPOINT=$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' ollama)

And finally, I can pass this endpoint URL to my web app to make calls with:

docker run -d -p 8080:8080 -e OLLAMA_API_ENDPOINT --rm --name my-web-app --net my-net app

With this, if you go to the following URL:

http://<PUBLIC_IP_OF_THE_EC2_INSTANCE>:8080

You can see the web app (chatbot) running and able to make API calls (chat) with the LLM.

Now I want to deploy this app in our AWS Kubernetes cluster (EKS). For that, I wrote the following inference.yaml manifest to run Ollama and serve the LLM:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: ollama-charlie-pv
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/ollama

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-charlie-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-charlie
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-charlie
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: ollama-charlie
    spec:
      nodeSelector:
        ollama-charlie-key: ollama-charlie-value
      initContainers:
      - name: download-llm
        image: ollama/ollama
        command: ["ollama", "run", "kristada673/solar-10.7b-instruct-v1.0-uncensored"]
        volumeMounts:
        - name: data
          mountPath: /root/.ollama
      containers:
      - name: ollama-charlie
        image: ollama/ollama
        volumeMounts:
        - name: data
          mountPath: /root/.ollama
        livenessProbe:
          tcpSocket:
            port: 80
          initialDelaySeconds: 120  # Adjust based on your app's startup time
          periodSeconds: 30
          failureThreshold: 2  # Pod is restarted after 2 consecutive failures
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: ollama-charlie-pvc
      restartPolicy: Always

---
apiVersion: v1
kind: Service
metadata:
  name: ollama-charlie-service
spec:
  selector:
    app: ollama-charlie
  ports:
    - protocol: TCP
      port: 11434
      targetPort: 11434

Here, ollama-charlie-key: ollama-charlie-value comes from the node group I created with a GPU (g4dn.xlarge), and these are the key and value I gave to the node group.

But there's some problem because when I do kubectl apply -f inference.yaml, the pod shows as pending and I get the following error:

Back-off restarting failed container download-llm in pod ollama-charlie-7745b595ff-5ldxt_default(57c6bba9-7d92-4cf8-a4ef-3b19f19023e4)

To diagnose it, when I do kubectl logs <pod_name> -c download-llm, I get:

Error: could not connect to ollama app, is it running?

This means that the Ollama service is not getting started. Could anyone help me figure out why, and edit the inference.yaml accordingly?

P.S.: Earlier, I tried with the following spec in inference.yaml:

spec:
      initContainers:
      - name: download-llm
        image: ollama/ollama
        command: ["ollama", "run", "kristada673/solar-10.7b-instruct-v1.0-uncensored"]
        volumeMounts:
        - name: data
          mountPath: /root/.ollama
      containers:
      - name: ollama-charlie
        image: ollama/ollama
        volumeMounts:
        - name: data
          mountPath: /root/.ollama
        resources:
          limits:
            nvidia.com/gpu: 1

Where I do not specify the node group I created and ask it to use a generic Nvidia GPU. That gave me the following error:

That's why I moved to specifying the key-value pair for the node group I created specifically for this deployment, and removed the instruction to use a generic Nvidia GPU.

Solution

I just went through the same thing while adding support for operating Ollama servers in the KubeAI project. Here is what I found:

The ollama cli behaves a little differently when you are running it within a docker container. You can reproduce that error as follows:

docker run ollama/ollama:latest run qwen2:0.5b
Error: could not connect to ollama app, is it running?

When you execute ollama run outside of docker, it appears to actually start up a HTTP API first, then the CLI starts sending requests to that API. When you run ollama run inside the docker container it is assuming that the server is already running (hence the could not connect part of the error). What you actually want to do in your instance is to just serve that HTTP API. The ollama serve command will do that for you. It turns out that serve is the default command specified in the Dockerfile: https://github.com/ollama/ollama/blob/1c70a00f716ed61c5b0a9e0f2a01876de0fc54d0/Dockerfile#L217

To resolve your error you just need to get rid of the command: part of your Deployment. This will allow ollama to start up and serve traffic.

The model will be pulled in and served when clients connect to the ollama Deployment (via your k8s Service) - either via a curl command or via running OLLAMA_HOST=<service-name>:<service-port> ollama run <your-model> from another Pod in your cluster.