Search code examples
pythonmachine-learninglightgbmpmml

categorical feature setting error in PMML GBDTLRClassifier


I try to set up my GBDTLRClassifier following the instruction here. First, I have done label encode on my columns. Then I define my categorical and continuous features, putting column names in two list.

cat  # categorical column names
conts # continuous column names

gbm = lgb.LGBMClassifier(n_estimator = 90)
classifier = GBDTLRClassifier(gbm, LogisticRegression(penalty='l2'))
dm = DataFrameMapper([([cat_col], CategoricalDomain()) for cat_col in cat] + [(conts, ContinuousDomain())])

pipeline = PMMLPipeline([('mapper', dm), ('classifier', classifier)])
pipeline.fit(df[cat + conts], df['y'], classifier__gbdt__eval_set=[(val[cat + conts], val['y'])], classifier__gbdt__early_stopping_rounds = 5, classifier__gbdt__categorical_feature=cat)

pp = make_pmml_pipeline(pipelin, target_fields=['y'])
sklearn2pmml(pp, '/tmp/lgb+lr.pmml')

I get error message in fitting:TypeError: Wrong type(str) or unknown name(root) in categorical_feature. While root is definitely in cat. Looks like lgbm not aware of which columns are categorical, which is confusing.

Moreover, when I remove the mapper part, no fitting error but convert failed in making pmml file with message: transformer object of the first step does not specify the number of input features.

Does anyone could tell how to make this procedure work. THx


Solution

  • Based on comment here, need to set feature_name when I send string column names into categorical_feature. A little tricky here.