Search code examples
palantir-foundry

Palantir Foundry model adapter - only returning 22 results no matter what the input is


I am trying to learn modeling in Foundry using Foundry Model Adapters. I'm also brand new to machine learning, so I did a very simple tutorial in Multilabel Classification with Scikit-Learn that I am trying to work into the Foundry model adapter setup. I'll note that in our Foundry stack model adapters are still marked as Beta, and I'm more concerned about getting this running in the Foundry realm than I have been with testing how accurate the model is for now.

I published a staging model in Modeling Objectives. Whenever I run the model inference transform it only returns 22 results when I am expecting over 3000 based on the input dataset. My expectation is that when I input a dataset of 'uncategorized' records for the model to run on, that it would run on all of those rows. Since I'm very new to all of this I am probably missing something very fundamental. The model inference just returns the same 22 rows no matter what I try as an input dataset. Everything runs as expected in a test Code Workspace with the same code which makes me wonder if I have missed something in the adapter.

Input/training dataset: PubMed MultiLabel Text Classification Dataset MeSH

Model training code (this is the model published to Modeling Objectives). Modified from the tutorial to use a sklearn pipeline since this seems to be the only way to make it work with the model adapter:

from transforms.api import transform, Input, Output
from palantir_models.transforms import ModelOutput
from palantir_models.models import ModelVersionChangeType
from main.model_adapters.adapter import ExampleModelAdapter
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression


@transform(
     features_and_labels_input=Input("/Modeling Tutorials/PubMed Multi Label Text Classification Dataset"),
     model_output=ModelOutput("/Modeling Tutorials/multi label classifier/multi_label_classifier"),
 )
 def compute(features_and_labels_input, model_output):
     df = features_and_labels_input.pandas()

     df = df.drop(['Title', 'meshMajor', 'pmid', 'meshid', 'meshroot', 'Z', 'V', 'N'], axis=1)
     X = df["abstractText"]
     y = np.asarray(df[df.columns[1:]])

     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
     clf = MultiOutputClassifier(LogisticRegression())
     pipe = Pipeline([
             ('tfidf', TfidfVectorizer()),
             ('clf', clf)
     ])

     model = pipe.fit(X_train, y_train)

     # Wrap the trained model in a ModelAdapter
     foundry_model = ExampleModelAdapter(model)

     # Publish and write the trained model to Foundry
     model_output.publish(
         model_adapter=foundry_model,
         change_type=ModelVersionChangeType.MINOR   # How to increment model version number: MAJOR (x.0.0), MINOR(0.x.0), PATCH (0.0.x)
     )

Sklearn adapter template that ships with model adapter code repositories. Nothing really changed here from the provided template except the columns in the api input/output:

from palantir_models.models import ModelAdapter, PythonEnvironment, CondaDependency
from palantir_models.models.api import ModelApi, ModelApiColumn, ModelInput, ModelOutput, DFType
from palantir_models.models._types import CondaVersionExact
import pickle
import os
import pandas as pd


class ExampleModelAdapter(ModelAdapter):
    MODEL_SAVE_LOCATION = 'model.pkl'
    METADATA_SAVE_LOCATION = 'metadata.pkl'
    PREDICTION_COLUMN_NAME_KEY = 'prediction_column_name'

    model = None
    prediction_column_name = None

    def __init__(self, model, prediction_column_name='prediction'):
        self.model = model
        self.prediction_column_name = prediction_column_name

    @classmethod
    def load(cls, state_reader, container_context):
        with state_reader.extract_to_temp_dir() as tmp_dir:
            model = pickle.load(open(os.path.join(tmp_dir, ExampleModelAdapter.MODEL_SAVE_LOCATION), "rb"))
            metadata = pickle.load(open(os.path.join(tmp_dir, ExampleModelAdapter.METADATA_SAVE_LOCATION), "rb"))
            prediction_column_name = metadata[ExampleModelAdapter.PREDICTION_COLUMN_NAME_KEY]
        return cls(model, prediction_column_name)

    def save(self, state_writer):
        with state_writer.open(ExampleModelAdapter.MODEL_SAVE_LOCATION, "wb") as model_file:
        pickle.dump(self.model, model_file)
        with state_writer.open(ExampleModelAdapter.METADATA_SAVE_LOCATION, "wb") as metadata_file:
            metadata = {
                ExampleModelAdapter.PREDICTION_COLUMN_NAME_KEY: self.prediction_column_name
            }
            pickle.dump(metadata, metadata_file)

    @classmethod
    def api(cls):
        inputs = [
            ModelInput.Tabular(name="input_df",
                           df_type=DFType.PANDAS,
                           columns=[ModelApiColumn(name="features", type=tuple)])
        ]
        outputs = [
            ModelOutput.Tabular(name="output_df",
                            columns=[ModelApiColumn(name="prediction", type=list)])
        ]
        return ModelApi(inputs, outputs)

    def run_inference(self, inputs, outputs):
        df_in = inputs.input_df
        df_out = outputs.output_df
        df = pd.DataFrame(self.model.predict(df_in))
        df_out.write(df)

    @classmethod
    def dependencies(cls):
        # DO NOT MODIFY THIS FUNCTION DEFINITION.
        # Copy this code into all model adapters published from this repo.
        # Dependencies should be added to /transforms-model-training/conda_recipe/meta.yaml
        from main._version import __version__ as generated_version_tag
        return PythonEnvironment(
            conda_dependencies=[
            CondaDependency(
                 "transforms-model-training-ri.stemma.main.repository.91e66421-692b-4338-84cb-27c6f1a1e785",
                 CondaVersionExact(version=f"{generated_version_tag}"),
                 "ri.stemma.main.repository.91e66421-692b-4338-84cb-27c6f1a1e785")
        ]
    )

Model inference transform:

from transforms.api import transform, Input, Output
from palantir_models.transforms import ModelInput


@transform(
    model=ModelInput("/Modeling Tutorials/multi label classifier/multi_label_classifier"),
    inference_input=Input("/Modeling Tutorials/PubMed Multi Label Text Classification Dataset"),
    output=Output("/Modeling Tutorials/inference_output"),
)
def compute(inference_input, model, output):
    inference_results = model.transform(input_df=inference_input)   # 1. Call ModelAdapter.transform with the inputs specified in ModelAdapter.api
    df_out = inference_results.output_df                            # 2. Collect the desired output from the named tuple of inference result outputs
    output.write_pandas(df_out)

Output of model inference transform (returns only 22 rows, I am expecting 3000): enter image description here


Solution

  • Your model adapter logic and your training logic both look great! The reason you're getting this specific error is this line:

    df = pd.DataFrame(self.model.predict(df_in))
    

    Your model was trained on one column, abstractText, but later when you run inference, you pass in the full pandas DataFrame and the model doesn't know how to work with that DataFrame. The first step of your pipeline, the TfidfVectorizer, ends up treating each column in df_in as a record to process. As you have 22 columns in that DataFrame, you end up with 22 output rows.

    You can fix this by generating predictions just on your abstractText column instead.

    df = pd.DataFrame(self.model.predict(df_in['abstractText']))
    

    Now df would be a dataset of the correct number of rows and one column for each prediction class. You can optionally name those columns too:

    def run_inference(self, inputs, outputs):
        df_in = inputs.input_df
        df_out = outputs.output_df
    
        df = pd.DataFrame(
            self.model.predict(df_in['abstractText']),
            columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M']
        )
    
        df_out.write(df)
    

    Hopefully that helps!