Search code examples
rdummy-variable

Factors and Dummy Variables in R


I am new to data analytic and learning R. I have few very basic questions which I am not very clear about. I hope to find some help here. Please bear with me..still learning -

I wrote a small function to perform basic exploratory analysis on a data set with 9 variables out of which 8 are of Int/Numeric type and 1 is Factor. The function is like this :

  out <- function(x) 
  {
    c <- class(x)
    na.len <- length(which(is.na(x)))
    m <- mean(x, na.rm = TRUE)
    s <- sd(x, na.rm = TRUE)
    uc <- m+3*s
    lc <- m-3*s
    return(c(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
  }

And I apply it to the data set using :

stats <- apply(train, 2, FUN = out)

But the output file has all the class of variables as Character and all the Means as NA. After some head hurting, I figured that the problem is due to the Factor variable. I converted it to Numeric using this :

train$MonthlyIncome=as.numeric(as.character(train$MonthlyIncome))

It worked fine. But I am confused that if without looking at the dataset I use the above function - it wont work. How can I handle this situation.

When should I consider creating dummy variables?

Thank you in advance, and I hope the questions are not too silly!


Solution

  • Note that c() results in a vector and all element within the vector must be of the same class. If the elements have different classes, then c() uses the least complex class which is able to hold all information. E.g. numeric and integer will result in numeric. character and integer will result in character.

    Use a list or a data.frame if you need different classes.

    out <- function(x) 
      {
        c <- class(x)
        na.len <- length(which(is.na(x)))
        m <- mean(x, na.rm = TRUE)
        s <- sd(x, na.rm = TRUE)
        uc <- m+3*s
        lc <- m-3*s
        return(data.frame(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
      }
    

    sum(is.na(x)) is faster than length(which(is.na(x)))

    Use lapply to run the function on each variable. Use do.call to append the resulting dataframes.

    stats <- do.call(
      rbind,
      lapply(train, out)
    )