Search code examples
tensorflowkubernetesscaletensorflow-serving

Kubernetes + TF serving - how to use hundred of ML models without running hundred of idle pods up and running?


I have hundreds of models, based on categories, projects,s, etc. Some of the models are heavily used while other models are not used very frequently. How can I trigger a scale-up operation only in case needed (For the models that are not frequently used), instead of running hundreds of pods serving hundreds of models while most of them are not being used - which is a huge waste of computing resources.


Solution

  • What you are trying to do is to scale deployment to zero when these are not used.

    K8s does not provide such functionality out of the box.

    You can achieve it using Knative Pod Autoscaler. Knative is probably the most mature solution available at the moment of writing this answer.

    There are also some more experimental solutions like osiris or zero-pod-autoscaler you may find interesting and that may be a good fit for your usecase.