I am looking to discretise continuous features in machine-learning datasets, in particular, using supervised discretisation. It turns out that r [has a package/method for this]1, great! But since I am not proficient in R I have some issues and I would greatly appreciate if you could help.
I get an error
class variable needs to be a factor.
I looked at an example online, and they do not seem to have this problem, but I do. Note that I do not quite understand the syntax V2 ~ .
, other than that V2
should be a column name.
library(caret)
library(Rcpp)
library(arulesCBA)
filename <- "wine.data"
dataset <- read.csv(filename, header=FALSE)
dataset2 <- discretizeDF.supervised(V2 ~ ., dataset, method = "mdlp")
R reports the following error:
Error in .parseformula(formula, data) : class variable needs to be a factor!
You may find the dataset wine.data here: https://pastebin.com/hvDbEtMN The first parameter of discretizeDF.supervised is a formula and that seems to be the problem.
Please help! Thank you in advance.
As written in the vignette, this is meant to implement:
several supervised methods to convert continuous variables into a categorical variables (factor) suitable for association rule mining and building associative classifiers.
If you look at your V2 column, it's continuous:
test = read.csv("wine_dataset.txt",header=FALSE)
str(test)
'data.frame': 178 obs. of 14 variables:
$ V1 : int 1 1 1 1 1 1 1 1 1 1 ...
$ V2 : num 14.2 13.2 13.2 14.4 13.2 ...
$ V3 : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
What you need is a target that is categorical, so that the algo can find suitable methods to discretize it for finding associations. For example:
#this cuts V2 into 4 categories according to where they fall in the range
test$V2 = factor(cut(test$V2,4,labels=1:4))
dataset2 <- discretizeDF.supervised(V2 ~ ., dataset, method = "mdlp")
The above is one way to get around, but you need to find ways to cut V2 well. If you need to use the target as a continuous, then you can use discretizeDF
from arules, and I also see that your first column is 1,2,3 only:
test = read.csv("wine_dataset.txt",header=FALSE)
test2 = data.frame(test[,1:2],discretizeDF(test[,-c(1:2)]))