Search code examples
rregressiondata-sciencedummy-variable

Some confusion regarding defining factor variables


When defining factor variables in R I have defined them as such up till now:

q5_data$high <- ifelse(q5_data$totexp >median(q5_data$totexp),1,0)

However I noticed people using things such as:

factor(directions, levels= c("North", "East", "South", "West"))

Do I have to define a factor variable explicitly as a factor variable or will simply having a vector of 1's and 0's work?


Solution

  • The question is in fact two questions.

    1.

    In R, the creation of dummy variables is seldom, if ever, needed. R's modelling functions take care of that automatically. But if you want to dichotomise a numeric variable, in the question's example as values below or above the median, ifelse is just one of the ways of doing it.

    Here are other two (essentially the same way). They take advantage of the fact that FALSE/TRUE are coded as the integers 0/1 and coerce the logical values to a numeric class.

    set.seed(2021)
    x <- runif(10, 0, 100)
    
    y <- ifelse(x > median(x), 1, 0)
    z <- as.integer(x > median(x))
    identical(y, z)
    #[1] FALSE
    

    The result is FALSE because though the values are equal the objects' classes are not.

    class(y)
    #[1] "numeric"
    class(z)
    #[1] "integer"
    

    The solution would not to care about that unless an identical result is needed.

    z2 <- as.numeric(x > median(x))
    identical(y, z2)
    #[1] TRUE
    

    To see why this is probably not needed, the regression functions will called it their own, run the following. Output omitted.

    model.matrix(~ x > median(x))
    

    2.

    A different problem is to bin the data. If you want to create a factor of small, medium and large out of a numeric variable, functions like cut, .bincode or findInterval can be of use.

    i <- findInterval(x, c(0, 33.33, 66.67, Inf))
    levels <- c("Small", "Medium", "Large")
    f <- factor(levels[i], levels = levels)
    
    f
    # [1] Medium Large  Large  Medium Medium Large  Medium Small  Large 
    #[10] Large 
    #Levels: Small Medium Large
    

    Why have I explicitly set the factor levels? Because R will default to the lexicographic order, and "Large" is the first, then "Medium" and "Small" would have been assigned to the greatest values. To assign factor levels manually gives full control on the result.