Search code examples
pythonjavascikit-learnpmmlsklearn2pmml

PMML model makes different predictions to original model


I have built an MLPClassifier in SKLearn for an NLP multi-label classification problem which makes use of CountVectorizer. The aim is to then move this into a Java project with PMML, specifically sklearn2pmml:

clf = PMMLPipeline ([('tf', CountVectorizer(token_pattern='\S+', max_features = 400)),
                 ('classifier', MLPClassifier(max_iter=300, random_state=1))
                ])

clf.fit(X, Y)

sklearn2pmml(clf, 'test.pmml', with_repr = True)

I am encountering an error when importing the PMML model, whether I import to Java or back into Python. Model makes completely different predictions once imported and always classifies as the same label regardless of input as shown here.

This varies greatly from the original model, so I assume I must have gone wrong somewhere.

Trying to fix this, I found this Stack Overflow post relating to a similar issue. One of the suggestions was to use DataFrames for training the model to avoid ambiguity. I currently use Series so I tried this.

I went from this:

X = data['tokenized']
Y = data['Type']

To this:

X = pd.DataFrame(columns = ['tokenized'], data = data.get('tokenized'))
Y = pd.DataFrame(columns = ['Type'], data = data.get('Type'))

However, trying to now train the model I get the following error:

ValueError: Found input variables with inconsistent numbers of samples: [1, 8492]

Is there a way to use DataFrames without causing an error like this? I've seen other posts suggesting its a difference in size between X and Y, but they both return the same value for .shape.

I'd like to know where I've gone wrong in training my original model, or if its to do with the data format I am passing the exported model. I appreciate any help!

EDIT:

Below is the PMML4S Java implementation which produced the incorrect results:

Model model = Model.fromFile(MLModel.class.getClassLoader().getResource("model-3.pmml").getFile());

        Series result = model
                .predict(Series.fromArray(new Object[] { tokenize("<script>alert(1)</script>") }, model.inputSchema()));
        for (int j = 0; j < model.outputFields().length; j++) {
            System.out.println(model.outputFields()[j].name() + " -- " + result.toArray()[j]);
        }

With this output:

probability(CMD) -- 2.154421715947742E-30
probability(Dir Traversal) -- 1.0667576496332316E-64
probability(SQLi) -- 5.253589894541598E-30
probability(Template) -- 0.9999999997883969
probability(XSS) -- 2.1160304728185066E-10
probability(XXE) -- 2.4613046473076675E-29

Using the JPMML library as suggested below, I get the correct results:

Evaluator evaluator = new LoadingModelEvaluatorBuilder()
                    .load(new File(MLModel.class.getClassLoader().getResource("model-3.pmml").getFile())).build();

            evaluator.verify();

            Map<String, String> arguments = new HashMap<>();
            arguments.put("tokenized", tokenize("<script>alert(1)</script>"));

            Map<String, ?> results = evaluator.evaluate(arguments);

            System.out.println(results.get("Type"));

With this output, correctly predicting the class as XSS:

{result=XSS, probability_entries=[CMD=0.02470137516811933, Dir Traversal=1.2112725851331142E-5, SQLi=5.288121150550626E-5, Template=9.421616924467085E-6, XSS=0.9716583665392814, XXE=0.003565842738318025], entityId=2/5}


Solution

  • Please give more information about your PMML deployment side ("whether I import to Java or back into Python"). Are you using PyPMML perhaps?

    What are the results if you switch to JPMML-Evaluator (for Java) or JPMML-Evaluator-Python (for Python) instead? The predictions should come out correct now.

    Trying to fix this, I found this Stack Overflow post relating to a similar issue.

    The referenced SO post is about PyPMML's broken column mapping mechanism.