What is the best option for build kubeflow components?

I am read about Kubeflow, and for create components there are two ways.

Container-Based
Function-Based

But there isn't an explication about why I should to use one or another, for example for load a Container-based, I need to generate a docker image push, and load in the pipeline the yaml, with the specification, but with function-based, I only need import the function.

And in order to apply ci-cd with the latest version, if I have a container-based, I can have a repo with all yml and load with load_by_url, but if they are a function, I can have a repo with all and load as a package too.

So what do you think that is the best approach container-based or function-based.

Thanks.

Solution

The short answer is it depends, but a more nuance answer is depends what you want to do with the component.

As base knowledge, when a KFP pipeline is compiled, it's actually a series of different YAMLs that are launched by Argo Workflows. All of these needs to be container based to run on Kubernetes, even if the container itself has all python.

A function to Python Container Op is a quick way to get started with Kubeflow Pipelines. It was designed to model after Airflow's python-native DSL. It will take your python function and run it within a defined Python container. You're right it's easier to encapsulate all your work within the same Git folder. This set up is great for teams just getting started with KFP and don't mind some boilerplate to get going quickly.

Components really become powerful when your team needs to share work, or you have an enterprise ML platform that is creating template logic of how to run specific jobs in a pipeline. The components can be separately versioned and built to use on any of your clusters in the same way (underlying container should be stored in docker hub or ECR, if you're on AWS). There are inputs/outputs to prescribe how the run will execute using the component. You can imagine a team in Uber might use a KFP to pull data for number of drivers in a certain zone. The inputs to the component could be Geo coordinate box and also time of day of when to load the data. The component saves the data to S3, which then is loaded to your model for training. Without the component, there would be quite a bit of boiler plate that would need to copy the code across multiple pipelines and users.

I'm a former PM at AWS for SageMaker and open source ML integrations, and this is sharing from my experience looking at enterprise set ups.