Search code examples
kubeflowgoogle-cloud-vertex-aikubeflow-pipelines

Is it possible to import custom source files into Kubeflow components?


I know that Kubeflow only modifies the container with the specified libraries to be installed. But I want to use my custom module in the training Component section of the pipeline.

So let me clarify my case; I'm deploying a GCP Vertex AI pipeline which exists of preprocessing and training steps. And there is also custom library that I created using some libraries like scikit. My main issue is that I want to re-use that library objects within my training step which looks like;

    packages_to_install = [
        "pandas",
        "sklearn",
        "mycustomlibrary?"
    ],
)
def train_xgb_model(
    dataset: Input[Dataset],
    model_artifact: Output[Model]
):
    
    from MyCustomLibrary import XGBClassifier
    import pandas as pd
    
    data = pd.read_csv(dataset.path)

    model = XGBClassifier(
        objective="binary:logistic"
    )
    model.fit(
        data.drop(columns=["target"]),
        data.target,
    )

    score = model.score(
        data.drop(columns=["target"]),
        data.target,
    )

    model_artifact.metadata["train_score"] = float(score)
    model_artifact.metadata["framework"] = "XGBoost"
    
    model.save_model(model_artifact.path)``` 

Solution

  • One option is to bake your custom module into a custom container image. Then you can use your customer image for the component as:

    @component(
        base_image='gcr.io/my-custom-image',
        packages_to_intall = [
            "pandas",
            "sklearn",
        ],
    )
    def train_xgb_model(...):
        ...
    

    In fact if you go this route, you might want to bake pandas and sklearn into your custom container as well.

    Alternatives include hosting your mycustomlibrary somewhere on the internet, it can be a GitHub repo for instance. And then you can install it as follows:

    @component(
        packages_to_intall = [
            "pandas",
            "sklearn",
            "git+https://my-repo/mycustomlibrary.git",
        ],
    )
    def train_xgb_model(...):
        ...
    

    Note that what specified in packages_to_install is passed to pip install command. And pip allows installing from various sources. For example: https://packaging.python.org/en/latest/tutorials/installing-packages/#installing-from-vcs