Model serving - tools and components

I am working on a solution for providing a custom platform catering to manage and run LLM applications using RAG and LLM models using user provided document repository. While planning and designing a solution, I came across few frameworks (open-source) such as KFServing, Deep-Java-Library, MLFlow and few more that are recommended to use along with ML pipelines orchestration (Kubeflow) along with Data-pipelines. I wanted to understand principles on how to choose the framework that suits to run models with scalable performance, especially using LLMOps stack for a variety of use cases, such as ChatAgents, Content generation (Emails), Code generation etc. Any pointers on how to choose the framework for the design of a platform that is capable for all the various scenarios in Gen-AI applications development and hosting.

Solution

Here, are some bullets points :

Scalability & Performance: Prioritize frameworks that offer robust scalability and optimized performance for LLM inference, with support for horizontal and vertical scaling. Since it requires a lot of resources you want to be sure that it won t go down.
Framework Flexibility & Orchestration: Try to choose frameworks that allow easy customization for diverse LLM applications and seamless integration with ML orchestration tools like Kubeflow for automated workflows. Also choose framework that support all the libraries (like pytorch, tensorflow etc.. which is the case for mlflow )
Community Support & Ecosystem: Consider the framework's community size, ecosystem support for extensions, and integration capabilities with other tools for comprehensive lifecycle management. You don't want to work with frameworks that are going to disappear.
Cost-Effectiveness & Documentation: Check the total cost of ownership and prioritize frameworks by checking documentation to facilitate development and maintenance.