I have a data frame containing a mixture of numeric and categorical variables (factors in R) such as follows:
df <- data.frame(
age = c(18, 29, 55, 90, 44),
sex = c("M", "F", "M", "M", "F"),
category = c("cat1", "cat1", "cat2", "cat2", "cat2"))
I need to run some statistical analysis on this data but this analysis requires the input data to be in the form of a numeric matrix. I can of course convert sex
and category
into numeric variables by using something like df$sex <- ifelse(df$sex == "M", 1, 0)
but this is very tedious to do manually when I have a lot of variables. Furthermore, the analysis will return a numeric matrix that I need to reconvert to the format of df
with the original categories, and so I'll need to use something like x$sex <- ifelse(x$sex == 1, "M", "F")
(here, x
holds the return value of the analysis) to reconvert each variable back to the original data format, essentially undoing the first conversion. For simplicity please assume all the categorical variables are non-ordinal (i.e. there is no order to the categories) and binary (only two factor levels).
So I think what I want are 2 functions that can do this automatically. I'm assuming I'll need two functions fct2num
and num2originalfct
. fct2num
will need to return the data frame as a matrix with my variables fct_var <- c("sex", "category")
converted appropriately to numeric but also some sort of a dictionary dict
of the mappings (e.g., sex = "F" : 0, "M" : 1
) that I'll need to pass to num2originalfct
along with the output of the analysis to be recoded back to the original categories. Any ideas or alternatives on how best to accomplish this? Something like this would work for fct2num
for one-hot encoding, but I'm not sure how I would revert the encoding back to the original factors.
You may strip off the factor labels using as.integer
and subtract 1
.
> df[chr] <- lapply(df[chr], \(x) as.integer(as.factor(x)) - 1L)
> m <- as.matrix(df)
> m
age sex category
[1,] 18 1 0
[2,] 29 0 0
[3,] 55 1 1
[4,] 90 1 1
[5,] 44 0 1
Update
To convert forth and back, you may follow an approach like this.
> df <- type.convert(df, as.is=FALSE) ## convert character to factor
> fac <- names(df)[sapply(df, is.factor)] ## store names of which are factors
> lev <- lapply(df[fac], attr, 'levels') ## store levels of the factors
> df[fac] <- lapply(df[fac], \(x, y) as.integer(x) - 1)
> m <- as.matrix(df)
> m
age sex category
[1,] 18 1 0
[2,] 29 0 0
[3,] 55 1 1
[4,] 90 1 1
[5,] 44 0 1
> ## do stuff with matrix
> df2 <- as.data.frame(m) ## convert back to data.frame
> df2[fac] <- Map(`levels<-`, lapply(df2[fac] + 1L, factor), lev) ## restore levels
> df2
age sex category
1 18 M cat1
2 29 F cat1
3 55 M cat2
4 90 M cat2
5 44 F cat2
Data:
> dput(df)
structure(list(age = c(18, 29, 55, 90, 44), sex = c(1, 0, 1,
1, 0), category = c(1, 1, 2, 2, 2)), row.names = c(NA, -5L), class = "data.frame")