Search code examples
pythonscikit-learnpipelinepmmlsklearn2pmml

sklearn to pmml, cant create pipeline for preprocessing step of categorical columns


I'm having a tough time trying to create a PMML pipeline in the library sklearn2pmml (python). I want to convert categorical variables to numerical ones by reasigning them but don't have any clue, I tried many sklearn preprocessors but they are not compatible, have anyone encounter the same problem? Here's an example,I know it is clearly wrong, but wanted to make sure that you understand what I'm trying to do. Even an automatable solution in PMML would help me. See the example below, thanks.

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml

# FORGET ABOIT TRAIN TEST SPLIT; we only care if  the  PMML pipeline works for now 
BIRTHDAY_SEED = 1995
nrows, cols = 1000, 5
X, y = make_classification(n_samples=nrows, n_features=cols, n_informative=2, n_redundant=3, n_classes=2, shuffle=True, random_state=BIRTHDAY_SEED)
X, y = pd.DataFrame(X), pd.Series(y)
X["cat_variable"] = np.random.choice(["a","b","c"],size=len(X) )

# DEFINE FUNCTIONS FOR REASIGNING CATEGORY
def simple_category_asignation(value):
    """
    The returns are random; I just want to reasign a number to a category.
    """
    if value == "a":
        return 1.5
    elif value == "b":
        return 2.0
    elif value == "c":
        return 1.97
    else:
        return -1
def preprocessing_cat_variables(X):
    """
    Reprocess cateogorical variable.
    """
    X["cat_variable"] = X["cat_variable"].apply(simple_category_asignation)
    return X

# FIT THE MODEL AND TRY TO CREATE THE PMML PIPELINE, it does not work
X = preprocessing_cat_variables(X)
model = DecisionTreeClassifier()
model.fit(X,y)

pmml_pipeline = PMMLPipeline([
    # here we should place the category preprocesor; I know it does not work but , so you can get the idea
    ("preprocessing_categories_step",preprocessing_cat_variables),
    # 
  ('decisiontree',model)
])
sklearn2pmml(pmml_pipeline, "example_pipeline_pmml.pmml", with_repr = True)

Solution

  • You can map from one categorical value space (strings) to another (floats) using the sklearn2pmml.preprocessing.LookupTransformer transformer type.

    Your code simplifies to this:

    from sklearn2pmml.preprocessing import LookupTransformer
    
    mapping = {
      "a" : 1.5,
      "b" : 2.0,
      "c" : 1.97
    }
    
    pmml_pipeline = PMMLPipeline([
      ("category_remapper", LookupTransformer(mapping, default_value = -1.0)),
      ("classifier", DecisionTreeClassifier())
    ])
    

    Alternatively, you can build a mapper based on free-form Python expressions using the sklearn2pmml.preprocessing.ExpressionTransformer transformation type.