Search code examples
rpmml

pmml in R generating improper variable names


I am using the pmml package in R to generate PMML for Logistic Regression model obtained using the glm function as follows:

library(pmml)
var <- sample(c(1,2,3),100,replace = TRUE)
var_cat <- sample(c(1,2,3,4),100,replace = TRUE)
y <- sample(c(0,1),100,replace = TRUE)
df <- data.frame(y = as.factor(y),var = as.factor(var), var_cat = as.factor(var_cat))
model <- glm(y ~ ., data = df, family = binomial)
pmmlOutput <- pmml(model)

The PPMatrix portion of this PMML is shown below:

<PPMatrix>
   <PPCell value="2" predictorName="var" parameterName="p1"/>
   <PPCell value="3" predictorName="var" parameterName="p2"/>
   <PPCell value="_cat2" predictorName="var" parameterName="p3"/>
   <PPCell value="2" predictorName="var_cat" parameterName="p3"/>
   <PPCell value="_cat3" predictorName="var" parameterName="p4"/>
   <PPCell value="3" predictorName="var_cat" parameterName="p4"/>
   <PPCell value="_cat4" predictorName="var" parameterName="p5"/>
   <PPCell value="4" predictorName="var_cat" parameterName="p5"/>
</PPMatrix>

The first variable and its levels appear alright as (var,2) and (var,3). However, there are two lines for the second variable with the variable name and the levels getting split at the wrong location.

Instead of getting (var_cat,2), it is getting split into (var,_cat2) as highlighted below:

<PPCell value="_cat2" predictorName="var" parameterName="p3"/>

This seems to happen only when there are overlapping variable names (in this case var and var_cat). However, this works fine if only var_cat variable is present.

Could someone suggest a way to address this issue?


Solution

  • Unfortunately, you are correct; you have found a bug in the R code.

    The way it finds the values effectively assumes that different variable names are not substrings of another.

    Since var is a substring of var_cat, you get this error. Notice that var_cat and cat would also potentially give you the same problem. On the other hand, var_cat1 is not a substring of var_cat2, so that should work.

    For now, the easiest way is to just name the variables so that a variable name is not a substring of another. Fortunately we are planning a new release in the next couple of weeks, I will try to include a fix for this in that release.