I am trying to learn modeling in Foundry using Foundry Model Adapters. I'm also brand new to machine learning, so I did a very simple tutorial in Multilabel Classification with Scikit-Learn that I am trying to work into the Foundry model adapter setup. I'll note that in our Foundry stack model adapters are still marked as Beta, and I'm more concerned about getting this running in the Foundry realm than I have been with testing how accurate the model is for now.
I published a staging model in Modeling Objectives. Whenever I run the model inference transform it only returns 22 results when I am expecting over 3000 based on the input dataset. My expectation is that when I input a dataset of 'uncategorized' records for the model to run on, that it would run on all of those rows. Since I'm very new to all of this I am probably missing something very fundamental. The model inference just returns the same 22 rows no matter what I try as an input dataset. Everything runs as expected in a test Code Workspace with the same code which makes me wonder if I have missed something in the adapter.
Input/training dataset: PubMed MultiLabel Text Classification Dataset MeSH
Model training code (this is the model published to Modeling Objectives). Modified from the tutorial to use a sklearn pipeline since this seems to be the only way to make it work with the model adapter:
from transforms.api import transform, Input, Output
from palantir_models.transforms import ModelOutput
from palantir_models.models import ModelVersionChangeType
from main.model_adapters.adapter import ExampleModelAdapter
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
@transform(
features_and_labels_input=Input("/Modeling Tutorials/PubMed Multi Label Text Classification Dataset"),
model_output=ModelOutput("/Modeling Tutorials/multi label classifier/multi_label_classifier"),
)
def compute(features_and_labels_input, model_output):
df = features_and_labels_input.pandas()
df = df.drop(['Title', 'meshMajor', 'pmid', 'meshid', 'meshroot', 'Z', 'V', 'N'], axis=1)
X = df["abstractText"]
y = np.asarray(df[df.columns[1:]])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = MultiOutputClassifier(LogisticRegression())
pipe = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', clf)
])
model = pipe.fit(X_train, y_train)
# Wrap the trained model in a ModelAdapter
foundry_model = ExampleModelAdapter(model)
# Publish and write the trained model to Foundry
model_output.publish(
model_adapter=foundry_model,
change_type=ModelVersionChangeType.MINOR # How to increment model version number: MAJOR (x.0.0), MINOR(0.x.0), PATCH (0.0.x)
)
Sklearn adapter template that ships with model adapter code repositories. Nothing really changed here from the provided template except the columns in the api input/output:
from palantir_models.models import ModelAdapter, PythonEnvironment, CondaDependency
from palantir_models.models.api import ModelApi, ModelApiColumn, ModelInput, ModelOutput, DFType
from palantir_models.models._types import CondaVersionExact
import pickle
import os
import pandas as pd
class ExampleModelAdapter(ModelAdapter):
MODEL_SAVE_LOCATION = 'model.pkl'
METADATA_SAVE_LOCATION = 'metadata.pkl'
PREDICTION_COLUMN_NAME_KEY = 'prediction_column_name'
model = None
prediction_column_name = None
def __init__(self, model, prediction_column_name='prediction'):
self.model = model
self.prediction_column_name = prediction_column_name
@classmethod
def load(cls, state_reader, container_context):
with state_reader.extract_to_temp_dir() as tmp_dir:
model = pickle.load(open(os.path.join(tmp_dir, ExampleModelAdapter.MODEL_SAVE_LOCATION), "rb"))
metadata = pickle.load(open(os.path.join(tmp_dir, ExampleModelAdapter.METADATA_SAVE_LOCATION), "rb"))
prediction_column_name = metadata[ExampleModelAdapter.PREDICTION_COLUMN_NAME_KEY]
return cls(model, prediction_column_name)
def save(self, state_writer):
with state_writer.open(ExampleModelAdapter.MODEL_SAVE_LOCATION, "wb") as model_file:
pickle.dump(self.model, model_file)
with state_writer.open(ExampleModelAdapter.METADATA_SAVE_LOCATION, "wb") as metadata_file:
metadata = {
ExampleModelAdapter.PREDICTION_COLUMN_NAME_KEY: self.prediction_column_name
}
pickle.dump(metadata, metadata_file)
@classmethod
def api(cls):
inputs = [
ModelInput.Tabular(name="input_df",
df_type=DFType.PANDAS,
columns=[ModelApiColumn(name="features", type=tuple)])
]
outputs = [
ModelOutput.Tabular(name="output_df",
columns=[ModelApiColumn(name="prediction", type=list)])
]
return ModelApi(inputs, outputs)
def run_inference(self, inputs, outputs):
df_in = inputs.input_df
df_out = outputs.output_df
df = pd.DataFrame(self.model.predict(df_in))
df_out.write(df)
@classmethod
def dependencies(cls):
# DO NOT MODIFY THIS FUNCTION DEFINITION.
# Copy this code into all model adapters published from this repo.
# Dependencies should be added to /transforms-model-training/conda_recipe/meta.yaml
from main._version import __version__ as generated_version_tag
return PythonEnvironment(
conda_dependencies=[
CondaDependency(
"transforms-model-training-ri.stemma.main.repository.91e66421-692b-4338-84cb-27c6f1a1e785",
CondaVersionExact(version=f"{generated_version_tag}"),
"ri.stemma.main.repository.91e66421-692b-4338-84cb-27c6f1a1e785")
]
)
Model inference transform:
from transforms.api import transform, Input, Output
from palantir_models.transforms import ModelInput
@transform(
model=ModelInput("/Modeling Tutorials/multi label classifier/multi_label_classifier"),
inference_input=Input("/Modeling Tutorials/PubMed Multi Label Text Classification Dataset"),
output=Output("/Modeling Tutorials/inference_output"),
)
def compute(inference_input, model, output):
inference_results = model.transform(input_df=inference_input) # 1. Call ModelAdapter.transform with the inputs specified in ModelAdapter.api
df_out = inference_results.output_df # 2. Collect the desired output from the named tuple of inference result outputs
output.write_pandas(df_out)
Output of model inference transform (returns only 22 rows, I am expecting 3000):
Your model adapter logic and your training logic both look great! The reason you're getting this specific error is this line:
df = pd.DataFrame(self.model.predict(df_in))
Your model was trained on one column, abstractText
, but later when you run inference, you pass in the full pandas DataFrame and the model doesn't know how to work with that DataFrame. The first step of your pipeline, the TfidfVectorizer
, ends up treating each column in df_in
as a record to process. As you have 22 columns in that DataFrame, you end up with 22 output rows.
You can fix this by generating predictions just on your abstractText
column instead.
df = pd.DataFrame(self.model.predict(df_in['abstractText']))
Now df
would be a dataset of the correct number of rows and one column for each prediction class. You can optionally name those columns too:
def run_inference(self, inputs, outputs):
df_in = inputs.input_df
df_out = outputs.output_df
df = pd.DataFrame(
self.model.predict(df_in['abstractText']),
columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M']
)
df_out.write(df)
Hopefully that helps!