Search code examples
rsyntaxdiscretization

"Class variable needs to be a factor" error for csv-read datasets


I am looking to discretise continuous features in datasets, in particular, using supervised . It turns out that [has a package/method for this]1, great! But since I am not proficient in R I have some issues and I would greatly appreciate if you could help.

I get an error

class variable needs to be a factor.

I looked at an example online, and they do not seem to have this problem, but I do. Note that I do not quite understand the V2 ~ ., other than that V2 should be a column name.

library(caret)
library(Rcpp)
library(arulesCBA)

filename <- "wine.data"
dataset <- read.csv(filename, header=FALSE)
dataset2 <- discretizeDF.supervised(V2 ~ ., dataset, method = "mdlp")

R reports the following error:

Error in .parseformula(formula, data) : class variable needs to be a factor!

You may find the dataset wine.data here: https://pastebin.com/hvDbEtMN The first parameter of discretizeDF.supervised is a formula and that seems to be the problem.

Please help! Thank you in advance.


Solution

  • As written in the vignette, this is meant to implement:

    several supervised methods to convert continuous variables into a categorical variables (factor) suitable for association rule mining and building associative classifiers.

    If you look at your V2 column, it's continuous:

    test = read.csv("wine_dataset.txt",header=FALSE)
    str(test)
    'data.frame':   178 obs. of  14 variables:
     $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
     $ V2 : num  14.2 13.2 13.2 14.4 13.2 ...
     $ V3 : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
    

    What you need is a target that is categorical, so that the algo can find suitable methods to discretize it for finding associations. For example:

    #this cuts V2 into 4 categories according to where they fall in the range
    test$V2 = factor(cut(test$V2,4,labels=1:4))
    dataset2 <- discretizeDF.supervised(V2 ~ ., dataset, method = "mdlp")
    

    The above is one way to get around, but you need to find ways to cut V2 well. If you need to use the target as a continuous, then you can use discretizeDF from arules, and I also see that your first column is 1,2,3 only:

    test = read.csv("wine_dataset.txt",header=FALSE)
    test2 = data.frame(test[,1:2],discretizeDF(test[,-c(1:2)]))