Search code examples
rapplysapplydata-cleaning

Select categorical variables where number of levels is equal to 1


Doing pre-processing in Data Mining sometimes involve re-grouping and re-coding categorical variables. It is well known that once you recode categorical variables in R (i.e. function mapvalues) you need to update your categorical variable with df$variable <- factor(df$variable) so that you can view the real number of levels in your data.frame with str(df).

I have written a piece of code to update automatically the categorical variables of a dataset:

cat <- sapply(df, is.factor) #Select categorical variables
names(df[ ,cat]) #View which are they
A <- function(x) factor(x) #Create function for "apply"
df[ ,cat] <- data.frame(apply(df[ ,cat],2, A)) #Run apply function
str(df) #Check

My question is: how could I select columns whose number of levels is equal to 1, once I have updated my dataset? I have tried these lines without luck:

cat <- sapply(df, is.factor) #Select categorical variables
categorical <- df[,cat] #Create a df named "categorical" separating them
A <- function(x) nlevels(x)==1 #Create "A" function for apply
x <- data.frame(apply(categorical,2, A)) #Run apply function
utils::View(x) #Check and see it is not working...

I appreciate your help and time


Solution

  • You can create a logical index with sapply and use that to filter out the columns. The reason

      indx <- sapply(df[,cat], nlevels)==1
      df[,cat][,indx, drop=FALSE]
    

    Or another option is Filter

     Filter(function(x) nlevels(x)==1, df[,cat])
    

    Or

     Filter(Negate(var), df[,cat])
    

    Regarding why the apply didn't work,

     apply(df[cat], 2, nlevels)
     # V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 
     # 0   0   0   0   0   0   0   0   0   0 
    

    the output is 0 for all the columns, so something is not correct. Upon further checking

     apply(df[cat], 2, class)
     #       V1          V2          V3          V4          V5          V6 
     #"character" "character" "character" "character" "character" "character" 
     #       V7          V8          V9         V10 
     #"character" "character" "character" "character" 
    

    And the correct class can be found from

     sapply(df[cat], class)
     #    V1       V2       V3       V4       V5       V6       V7       V8 
     #"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" 
     #    V9      V10 
     #"factor" "factor" 
    

    The class of the columns got changed from 'factor' to 'character' because the output of apply is a matrix and a matrix can hold only a single class. If there is any non-numeric column, it will convert the whole matrix columns to 'character' class. You can use apply for a numeric matrix as the the return class will be also 'numeric. In general, when there are mixed class columns, it is better to use lapply/vapply and to get a logical vector or so sapply is also useful.

    data

    set.seed(64)
    df <- as.data.frame(matrix(sample(LETTERS[1:3], 3*10, replace=TRUE), ncol=10))
    
    df <- cbind(df, V11=1:3)
    cat <- sapply(df, is.factor)