Search code examples
rstatisticsmodeling

Auto cleaning functions before modelling in R


Perhaps this is a dumb question but I am a new convert from SAS and I am still figuring my way around. What is the easiest way to clean a data set before running models. Eg: I have a dataset with a 100 variables. How can I remove character/factor variables with less than 2 levels before running a model? This seems to occur on the fly in SAS and I find it a pain to manually drop variables in R before modelling. Surely there should be a better way. Thanks in advance.


Solution

  • You could try: (modification of @Richard Scriven't answer)

    indx <- sapply(dat, function(x) length(levels(x))<2 & is.factor(x))
    dat1 <- dat[,!indx]
    head(dat1)
    #       Col1 Col3
    #1  1.3709584    B
    #2 -0.5646982    B
    #3  0.3631284    B
    #4  0.6328626    D
    #5  0.4042683    A
    #6 -0.1061245    D
    

    If you have both character and factor columns and want to remove those columns with <2 unique levels/values

    dat$Col4 <- as.character(dat$Col4)
    

    If I try the above code, I would get (which is wrong):

     head(dat[,!indx],2)
     #        Col1 Col3 Col4
     #1  1.3709584    B  Yes
     #2 -0.5646982    B  Yes
    

    Here, you could do:

    indx1 <- sapply(dat, function(x) !is.numeric(x) & length(unique(x))<2)
    head(dat[,!indx1])
      #       Col1 Col3
      #1  1.3709584    B
      #2 -0.5646982    B
      #3  0.3631284    B
      #4  0.6328626    D
      #5  0.4042683    A
      #6 -0.1061245    D
    

    data

    set.seed(42)
     dat <- data.frame(Col1=rnorm(25), Col2=LETTERS[1], 
         Col3=sample(LETTERS[1:5], 25, replace=TRUE), Col4="Yes")