Doing pre-processing in Data Mining sometimes involve re-grouping and re-coding categorical variables. It is well known that once you recode categorical variables in R (i.e. function mapvalues
) you need to update your categorical variable with df$variable <- factor(df$variable)
so that you can view the real number of levels in your data.frame with str(df)
.
I have written a piece of code to update automatically the categorical variables of a dataset:
cat <- sapply(df, is.factor) #Select categorical variables
names(df[ ,cat]) #View which are they
A <- function(x) factor(x) #Create function for "apply"
df[ ,cat] <- data.frame(apply(df[ ,cat],2, A)) #Run apply function
str(df) #Check
My question is: how could I select columns whose number of levels is equal to 1, once I have updated my dataset? I have tried these lines without luck:
cat <- sapply(df, is.factor) #Select categorical variables
categorical <- df[,cat] #Create a df named "categorical" separating them
A <- function(x) nlevels(x)==1 #Create "A" function for apply
x <- data.frame(apply(categorical,2, A)) #Run apply function
utils::View(x) #Check and see it is not working...
I appreciate your help and time
You can create a logical index with sapply
and use that to filter out the columns. The reason
indx <- sapply(df[,cat], nlevels)==1
df[,cat][,indx, drop=FALSE]
Or another option is Filter
Filter(function(x) nlevels(x)==1, df[,cat])
Or
Filter(Negate(var), df[,cat])
Regarding why the apply
didn't work,
apply(df[cat], 2, nlevels)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 0 0 0 0 0 0 0 0 0 0
the output is 0
for all the columns, so something is not correct. Upon further checking
apply(df[cat], 2, class)
# V1 V2 V3 V4 V5 V6
#"character" "character" "character" "character" "character" "character"
# V7 V8 V9 V10
#"character" "character" "character" "character"
And the correct class
can be found from
sapply(df[cat], class)
# V1 V2 V3 V4 V5 V6 V7 V8
#"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor"
# V9 V10
#"factor" "factor"
The class
of the columns got changed from 'factor' to 'character' because the output of apply
is a matrix
and a matrix
can hold only a single class. If there is any non-numeric column, it will convert the whole matrix columns to 'character' class. You can use apply
for a numeric
matrix as the the return class will be also 'numeric. In general, when there are mixed class columns, it is better to use lapply/vapply
and to get a logical vector or so sapply
is also useful.
set.seed(64)
df <- as.data.frame(matrix(sample(LETTERS[1:3], 3*10, replace=TRUE), ncol=10))
df <- cbind(df, V11=1:3)
cat <- sapply(df, is.factor)