Search code examples
rdataframedata-manipulation

R functions to convert factors to numeric and then back to original factor levels


I have a data frame containing a mixture of numeric and categorical variables (factors in R) such as follows:

df <- data.frame(
                 age = c(18, 29, 55, 90, 44),
                 sex = c("M", "F", "M", "M", "F"),
                 category = c("cat1", "cat1", "cat2", "cat2", "cat2"))

I need to run some statistical analysis on this data but this analysis requires the input data to be in the form of a numeric matrix. I can of course convert sex and category into numeric variables by using something like df$sex <- ifelse(df$sex == "M", 1, 0) but this is very tedious to do manually when I have a lot of variables. Furthermore, the analysis will return a numeric matrix that I need to reconvert to the format of df with the original categories, and so I'll need to use something like x$sex <- ifelse(x$sex == 1, "M", "F") (here, x holds the return value of the analysis) to reconvert each variable back to the original data format, essentially undoing the first conversion. For simplicity please assume all the categorical variables are non-ordinal (i.e. there is no order to the categories) and binary (only two factor levels).

So I think what I want are 2 functions that can do this automatically. I'm assuming I'll need two functions fct2num and num2originalfct. fct2num will need to return the data frame as a matrix with my variables fct_var <- c("sex", "category") converted appropriately to numeric but also some sort of a dictionary dict of the mappings (e.g., sex = "F" : 0, "M" : 1) that I'll need to pass to num2originalfct along with the output of the analysis to be recoded back to the original categories. Any ideas or alternatives on how best to accomplish this? Something like this would work for fct2num for one-hot encoding, but I'm not sure how I would revert the encoding back to the original factors.


Solution

  • You may strip off the factor labels using as.integer and subtract 1.

    > df[chr] <- lapply(df[chr], \(x) as.integer(as.factor(x)) - 1L)
    > m <- as.matrix(df)
    > m
         age sex category
    [1,]  18   1        0
    [2,]  29   0        0
    [3,]  55   1        1
    [4,]  90   1        1
    [5,]  44   0        1
    

    Update

    To convert forth and back, you may follow an approach like this.

    > df <- type.convert(df, as.is=FALSE)  ## convert character to factor
    > fac <- names(df)[sapply(df, is.factor)]  ## store names of which are factors
    > lev <- lapply(df[fac], attr, 'levels')  ## store levels of the factors
    > df[fac] <- lapply(df[fac], \(x, y) as.integer(x) - 1)
    > m <- as.matrix(df)
    > m
         age sex category
    [1,]  18   1        0
    [2,]  29   0        0
    [3,]  55   1        1
    [4,]  90   1        1
    [5,]  44   0        1
    > ## do stuff with matrix
    > df2 <- as.data.frame(m)  ## convert back to data.frame
    > df2[fac] <- Map(`levels<-`, lapply(df2[fac] + 1L, factor), lev) ## restore levels
    > df2
      age sex category
    1  18   M     cat1
    2  29   F     cat1
    3  55   M     cat2
    4  90   M     cat2
    5  44   F     cat2
    

    Data:

    > dput(df)
    structure(list(age = c(18, 29, 55, 90, 44), sex = c(1, 0, 1, 
    1, 0), category = c(1, 1, 2, 2, 2)), row.names = c(NA, -5L), class = "data.frame")