I try to set up my GBDTLRClassifier following the instruction here. First, I have done label encode on my columns. Then I define my categorical and continuous features, putting column names in two list.
cat # categorical column names
conts # continuous column names
gbm = lgb.LGBMClassifier(n_estimator = 90)
classifier = GBDTLRClassifier(gbm, LogisticRegression(penalty='l2'))
dm = DataFrameMapper([([cat_col], CategoricalDomain()) for cat_col in cat] + [(conts, ContinuousDomain())])
pipeline = PMMLPipeline([('mapper', dm), ('classifier', classifier)])
pipeline.fit(df[cat + conts], df['y'], classifier__gbdt__eval_set=[(val[cat + conts], val['y'])], classifier__gbdt__early_stopping_rounds = 5, classifier__gbdt__categorical_feature=cat)
pp = make_pmml_pipeline(pipelin, target_fields=['y'])
sklearn2pmml(pp, '/tmp/lgb+lr.pmml')
I get error message in fitting:TypeError: Wrong type(str) or unknown name(root) in categorical_feature
. While root
is definitely in cat
. Looks like lgbm not aware of which columns are categorical, which is confusing.
Moreover, when I remove the mapper part, no fitting error but convert failed in making pmml file with message: transformer object of the first step does not specify the number of input features
.
Does anyone could tell how to make this procedure work. THx
Based on comment here, need to set feature_name
when I send string column names into categorical_feature
. A little tricky here.