Search code examples
rtidyverseplyr

How to elegantly recode multiple columns containing multiple values


I have a dataframe with a mix of continuous and categorical data.

df<- data.frame(gender=c("male","female","transgender"),
                    education=c("high-school","grad-school","home-school"),
                    smoke=c("yes","no","prefer not tell"))
> print(df)
       gender   education           smoke
1        male high-school             yes
2      female grad-school              no
3 transgender home-school prefer not tell
> str(df)
'data.frame':   3 obs. of  3 variables:
 $ gender   : chr  "male" "female" "transgender"
 $ education: chr  "high-school" "grad-school" "home-school"
 $ smoke    : chr  "yes" "no" "prefer not tell"

I'm trying to recode the categorical columns to nominal format. My current approach is significantly tedious. First, I have to convert all character variables to factor format,

# Coerce all character formats to Factors
df<- data.frame(df[sapply(df, is.character)] <-
  lapply(df[sapply(df, is.character)], as.factor))

library(plyr)
df$gender<- revalue(df$gender,c("male"="1","female"="2","transgender"="3"))
df$education<- revalue(df$education,c("high-school"="1","grad-school"="2","home-school"="3"))
df$smoke<- revalue(df$smoke,c("yes"="1","no"="2","prefer not tell"="3"))
> print(df)
  gender education smoke
1      1         1     1
2      2         2     2
3      3         3     3

Is there a more elegant way to approach this problem? Something along the lines of tidyverse style will be helpful. I have already seen somewhat similar questions like 1, 2,3. The issue with these solutions are either they are not relevant to what I seek or else they using base R approaches like lapply() or sapply(), which is difficult for me to interpret. I would also like to know if there is an elegant approach to convert all character variables to factor format along the lines of tidyverse approach.


Solution

  • Try this. Just take into account that we are using mutate() and across() twice in order to first transform values to factor ordered by how they appear in each variable (unique()), and then the numeric side with as.numeric() to extract the values. Here the code:

    library(tidyverse)
    #Code
    df %>% mutate(across(gender:smoke,~factor(.,levels = unique(.)))) %>%
      mutate(across(gender:smoke,~as.numeric(.)))
    

    Output:

      gender education smoke
    1      1         1     1
    2      2         2     2
    3      3         3     3
    

    And in order to identify how the new values will be assigned you can use this:

    #Code 2
    df %>% summarise_all(.funs = unique) %>% pivot_longer(everything()) %>%
      arrange(name) %>%
      group_by(name) %>% mutate(Newval=1:n())
    

    Output:

    # A tibble: 9 x 3
    # Groups:   name [3]
      name      value           Newval
      <chr>     <fct>            <int>
    1 education high-school          1
    2 education grad-school          2
    3 education home-school          3
    4 gender    male                 1
    5 gender    female               2
    6 gender    transgender          3
    7 smoke     yes                  1
    8 smoke     no                   2
    9 smoke     prefer not tell      3
    

    Or maybe for more control:

    #Code 3
    df %>% mutate(id=1:n()) %>% pivot_longer(-id) %>%
      left_join(df %>% summarise_all(.funs = unique) %>% pivot_longer(everything()) %>%
                  arrange(name) %>%
                  group_by(name) %>% mutate(Newval=1:n()) %>% ungroup()) %>%
      select(-value) %>%
      pivot_wider(names_from = name,values_from=Newval) %>%
      select(-id)
    

    Output:

    # A tibble: 3 x 3
      gender education smoke
       <int>     <int> <int>
    1      1         1     1
    2      2         2     2
    3      3         3     3
    

    In case your variables are of class character you can use this pipeline to transform from character to factor, then re organize the factor and then make them numeric:

    #Code 4
    df %>% 
      mutate(across(gender:smoke,~as.factor(.))) %>%
      mutate(across(gender:smoke,~factor(.,levels = unique(.)))) %>%
      mutate(across(gender:smoke,~as.numeric(.)))
    

    Output:

      gender education smoke
    1      1         1     1
    2      2         2     2
    3      3         3     3