When defining factor variables in R I have defined them as such up till now:
q5_data$high <- ifelse(q5_data$totexp >median(q5_data$totexp),1,0)
However I noticed people using things such as:
factor(directions, levels= c("North", "East", "South", "West"))
Do I have to define a factor variable explicitly as a factor variable or will simply having a vector of 1's and 0's work?
The question is in fact two questions.
In R, the creation of dummy variables is seldom, if ever, needed. R's modelling functions take care of that automatically. But if you want to dichotomise a numeric variable, in the question's example as values below or above the median, ifelse
is just one of the ways of doing it.
Here are other two (essentially the same way). They take advantage of the fact that FALSE/TRUE
are coded as the integers 0/1
and coerce the logical values to a numeric class.
set.seed(2021)
x <- runif(10, 0, 100)
y <- ifelse(x > median(x), 1, 0)
z <- as.integer(x > median(x))
identical(y, z)
#[1] FALSE
The result is FALSE
because though the values are equal the objects' classes are not.
class(y)
#[1] "numeric"
class(z)
#[1] "integer"
The solution would not to care about that unless an identical
result is needed.
z2 <- as.numeric(x > median(x))
identical(y, z2)
#[1] TRUE
To see why this is probably not needed, the regression functions will called it their own, run the following. Output omitted.
model.matrix(~ x > median(x))
A different problem is to bin the data. If you want to create a factor of small, medium and large out of a numeric variable, functions like cut
, .bincode
or findInterval
can be of use.
i <- findInterval(x, c(0, 33.33, 66.67, Inf))
levels <- c("Small", "Medium", "Large")
f <- factor(levels[i], levels = levels)
f
# [1] Medium Large Large Medium Medium Large Medium Small Large
#[10] Large
#Levels: Small Medium Large
Why have I explicitly set the factor levels? Because R will default to the lexicographic order, and "Large"
is the first, then "Medium"
and "Small"
would have been assigned to the greatest values. To assign factor levels manually gives full control on the result.