Search code examples
numpymachine-learningpcapmml

How to avoid the max. amount of input fields for JPMML


I have problems using PMML models in JPMML (scala) with many input fields. Find a minimal example below: Load an image with 300x150 pixel and use this as an input for a PCA (python):

img = PIL.Image.open(filename)
img = img.resize(STANDARD_SIZE) # 300x150
img = np.array([int(np.mean(a)) for a in img])

pca   = PCA(svd_solver=pca_method,n_components = components)
train = pca.fit_transform(train_x)

pipeline = PMMLPipeline(([('pca', pca), ('knn', neigh)]))
sklearn2pmml(pipeline, "/tmp/pca.pmml")

In a second step this model should be loaded using JPMML (scala):

val evaluator = new LoadingModelEvaluatorBuilder()
      .setLocatable(false)
      .load(new File("/tmp/pca.pmml"))
      .build()
evaluator.verify()

which will lead to the quite obvious exception:

Exception in thread "main" org.jpmml.evaluator.InvalidElementException: Model has too many input fields
    at org.jpmml.evaluator.ModelEvaluatorBuilder.checkSchema(ModelEvaluatorBuilder.java:135)
    at org.jpmml.evaluator.ModelEvaluatorBuilder.build(ModelEvaluatorBuilder.java:115)
    ...

If you look at the source code you can find the following limit at the ModelEvaluatorBuilder:

if((inputFields.size() + groupFields.size()) > 1000){
            throw new InvalidElementException("Model has too many input fields", miningSchema);
        }

So my 45k input fields are way too much. If I got the PMML documentation right I can only use atomic datatypes (int, char, double, etc.) for the inpt fields.

Any ideas how I can actually work around this limit?


Solution

  • You can override the ModelEvaluatorBuilder#checkSchema(ModelEvaluator) method with your own checking logic (such as "accept everything"):

    evaluator = new LoadingModelEvaluatorBuilder(){
        @Override
        protected void checkSchema(ModelEvaluator<?> modelEvaluator){
            // Anything goes - I'm willing to accept the responsibility for my own actions 
        }
    }
        .setLocatable(false)
        .load(new File("/tmp/pca.pmml"))
        .build();
    

    This sanity check is there for a reason. (J)PMML is not meant for processing binary blobs (such as images), and it's a really bad idea to represent an image object as 45k double fields.