I am pretty new in Kubernetese, so apologies if this my questions seem vague. I try to elaborate as much as possible. I have a pod on Google Cloud via Kubernetese that has a GPU in it. This GPU is responsible for processing one set of tasks, let's say classifying images. In order to do so, I created a service with kubernetes. The service section of my yaml file looks something the following. Also the url for this service will be http://model-server-service.default.svc.cluster.local
since the name of the Service is moderl-server-service
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: model-server
name: model-server
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: model-server
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: model-server
spec:
containers:
- args:
- -t
- "120"
- -b
- "0.0.0.0"
- app:flask_app
command:
- gunicorn
env:
- name: ENV
value: staging
- name: GCP
value: "2"
image: gcr.io/my-production/my-model-server: myGitHash
imagePullPolicy: Always
name: model-server
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
protocol: TCP
volumeMounts:
- name: model-files
mountPath: /model-server/models
# These containers are run during pod initialization
initContainers:
- name: model-download
image: gcr.io/my-production/my-model-server: myGitHash
command:
- gsutil
- cp
- -r
- gs://my-staging-models/*
- /model-files/
volumeMounts:
- name: model-files
mountPath: "/model-files"
volumes:
- name: model-files
emptyDir: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsUser: 0
terminationGracePeriodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
labels:
app: model-server
name: model-server-service
namespace: default
spec:
ports:
- port: 80
protocol: TCP
targetPort: 8000
selector:
app: model-server
sessionAffinity: None
type: ClusterIP
Here my question begins. I am creating a new set of tasks. For this new set of tasks, I will need extensive memory, so I do not want to use the previous service. I would like to do it as part of a separate new service. Something with the following url http://model-server-heavy-service.default.svc.cluster.local
. I tried to create a new yaml file model-server-heavy.yaml
. In this new yaml file, I changed the name of the service from model-server-service
into model-server-heavy-service
. Also, I changed the name of the app and name from model-server
into model-sever-heavy
. So the final yaml file looks like something like what I put at the end of this post. Unfortunately, the new model sever does not work and I get the following message for the new model server on kubernetes.
model-server-asdhjs-asd 1/1 Running 0 21m
model-server-heavy-xnshk 0/1 **CrashLoopBackOff** 8 21m
Can someone please shed some light on what I am doing wrong and what would be the alternative for what I have in mind? Why do I get the message CrashLoopBackOff for the second model server? What is it that I am not doing correctly for the second model server.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: model-server-heavy
name: model-server-heavy
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: model-server-heavy
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: model-server-heavy
spec:
containers:
- args:
- -t
- "120"
- -b
- "0.0.0.0"
- app:flask_app
command:
- gunicorn
env:
- name: ENV
value: staging
- name: GCP
value: "2"
image: gcr.io/my-production/my-model-server:mgGitHash
imagePullPolicy: Always
name: model-server-heavy
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
protocol: TCP
volumeMounts:
- name: model-files
mountPath: /model-server-heavy/models
# These containers are run during pod initialization
initContainers:
- name: model-download
image: gcr.io/my-production/my-model-server:myGitHash
command:
- gsutil
- cp
- -r
- gs://my-staging-models/*
- /model-files/
volumeMounts:
- name: model-files
mountPath: "/model-files"
volumes:
- name: model-files
emptyDir: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsUser: 0
terminationGracePeriodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
labels:
app: model-server-heavy
name: model-server-heavy-service
namespace: default
spec:
ports:
- port: 80
protocol: TCP
targetPort: 8000
selector:
app: model-server-heavy
sessionAffinity: None
type: ClusterIP
Thanks to @dawid-kruk and @patrick-w I had to make two modification in the model-sever-heavy.yaml
in order for it to work.
Change the mountPath from /model-server-heavy/models
into /model-server/models
In line 38 of the model-sever-heavy.yaml
file, I should have changed the name from model-server-heavy
into model-sever
.
I first tried to fix the problem by applying the item 1 but it didn't work out. Then I tried the 2nd item as well and it fixed. I need to have both 1 and 2 in place in order for the server to work. I understand why I had to make change for the first item but not sure about the second one.