Search code examples
rr-factor

Variable must have at least two levels (R code)


My training data looks like this:

A   B   C   D
1   1   1   1
1   1   1   2
1   1   2   1
1   1   2   1
1   1   2   2
1   1   2   2
1   2   1   1
1   2   1   1
1   2   1   2
1   2   1   2
1   2   2   1
1   2   2   2
2   1   1   1
2   1   1   1
2   1   1   2
2   1   1   2
2   1   2   1
2   1   2   1
2   1   2   2
2   1   2   2
2   2   1   1
2   2   1   2
2   2   2   1
2   2   2   2
2   2   2   2

And my test data:

A   B   C   D
1   1   2   1
1   1   2   2
1   1   1   1
2   1   2   2

I did fitting using:

dag <- model2network("[A][B][C|A:B][D|A:B:C]")
training <- bn.fit(dag, trainingData, method = "mle", keep.fitted = TRUE)

And I am trying to predict values for column D using:

predicted = predict(training, node = "D", data = testData,  method = "parents", prob = FALSE)

But I get the error

Error in check.data(data, allow.levels = TRUE) : variable B must have at least two levels.

How do I fix this? I was of the view that test data does not need to have all levels that are included in the training data - in fact, shouldn't it be possible to predict even if the test data has one instance only?


Solution

  • Since your variables are all encoded as factor they "have" a list of factor levels. When you create training you have 1 and 2 in column B, and the factor levels are (implicitly, in the background) set to c(1, 2). But when you create testData you only have 1 in column B, and the factor levels are (implicitly, in the background) set to only 1.

    We can fix this by explicitly stating that testData$B has the levels c(1, 2) even though only 1 appears in the data.

    testData$B <- factor(testData$B, levels=c(1, 2))
    

    Edit:

    Fixed the stupid mistake where I wrote training while I totally intended to write testData