Perhaps this is a dumb question but I am a new convert from SAS and I am still figuring my way around. What is the easiest way to clean a data set before running models. Eg: I have a dataset with a 100 variables. How can I remove character/factor variables with less than 2 levels before running a model? This seems to occur on the fly in SAS and I find it a pain to manually drop variables in R before modelling. Surely there should be a better way. Thanks in advance.
You could try: (modification of @Richard Scriven't answer)
indx <- sapply(dat, function(x) length(levels(x))<2 & is.factor(x))
dat1 <- dat[,!indx]
head(dat1)
# Col1 Col3
#1 1.3709584 B
#2 -0.5646982 B
#3 0.3631284 B
#4 0.6328626 D
#5 0.4042683 A
#6 -0.1061245 D
If you have both character
and factor
columns and want to remove those columns with <2
unique levels/values
dat$Col4 <- as.character(dat$Col4)
If I try the above code, I would get (which is wrong):
head(dat[,!indx],2)
# Col1 Col3 Col4
#1 1.3709584 B Yes
#2 -0.5646982 B Yes
Here, you could do:
indx1 <- sapply(dat, function(x) !is.numeric(x) & length(unique(x))<2)
head(dat[,!indx1])
# Col1 Col3
#1 1.3709584 B
#2 -0.5646982 B
#3 0.3631284 B
#4 0.6328626 D
#5 0.4042683 A
#6 -0.1061245 D
set.seed(42)
dat <- data.frame(Col1=rnorm(25), Col2=LETTERS[1],
Col3=sample(LETTERS[1:5], 25, replace=TRUE), Col4="Yes")