I have a dataframe with a mix of continuous and categorical data.
df<- data.frame(gender=c("male","female","transgender"),
education=c("high-school","grad-school","home-school"),
smoke=c("yes","no","prefer not tell"))
> print(df)
gender education smoke
1 male high-school yes
2 female grad-school no
3 transgender home-school prefer not tell
> str(df)
'data.frame': 3 obs. of 3 variables:
$ gender : chr "male" "female" "transgender"
$ education: chr "high-school" "grad-school" "home-school"
$ smoke : chr "yes" "no" "prefer not tell"
I'm trying to recode the categorical columns to nominal format. My current approach is significantly tedious. First, I have to convert all character variables to factor format,
# Coerce all character formats to Factors
df<- data.frame(df[sapply(df, is.character)] <-
lapply(df[sapply(df, is.character)], as.factor))
library(plyr)
df$gender<- revalue(df$gender,c("male"="1","female"="2","transgender"="3"))
df$education<- revalue(df$education,c("high-school"="1","grad-school"="2","home-school"="3"))
df$smoke<- revalue(df$smoke,c("yes"="1","no"="2","prefer not tell"="3"))
> print(df)
gender education smoke
1 1 1 1
2 2 2 2
3 3 3 3
Is there a more elegant way to approach this problem? Something along the lines of tidyverse
style will be helpful. I have already seen somewhat similar questions like 1, 2,3. The issue with these solutions are either they are not relevant to what I seek or else they using base R approaches like lapply()
or sapply()
, which is difficult for me to interpret. I would also like to know if there is an elegant approach to convert all character variables to factor format along the lines of tidyverse approach.
Try this. Just take into account that we are using mutate()
and across()
twice in order to first transform values to factor ordered by how they appear in each variable (unique()
), and then the numeric side with as.numeric()
to extract the values. Here the code:
library(tidyverse)
#Code
df %>% mutate(across(gender:smoke,~factor(.,levels = unique(.)))) %>%
mutate(across(gender:smoke,~as.numeric(.)))
Output:
gender education smoke
1 1 1 1
2 2 2 2
3 3 3 3
And in order to identify how the new values will be assigned you can use this:
#Code 2
df %>% summarise_all(.funs = unique) %>% pivot_longer(everything()) %>%
arrange(name) %>%
group_by(name) %>% mutate(Newval=1:n())
Output:
# A tibble: 9 x 3
# Groups: name [3]
name value Newval
<chr> <fct> <int>
1 education high-school 1
2 education grad-school 2
3 education home-school 3
4 gender male 1
5 gender female 2
6 gender transgender 3
7 smoke yes 1
8 smoke no 2
9 smoke prefer not tell 3
Or maybe for more control:
#Code 3
df %>% mutate(id=1:n()) %>% pivot_longer(-id) %>%
left_join(df %>% summarise_all(.funs = unique) %>% pivot_longer(everything()) %>%
arrange(name) %>%
group_by(name) %>% mutate(Newval=1:n()) %>% ungroup()) %>%
select(-value) %>%
pivot_wider(names_from = name,values_from=Newval) %>%
select(-id)
Output:
# A tibble: 3 x 3
gender education smoke
<int> <int> <int>
1 1 1 1
2 2 2 2
3 3 3 3
In case your variables are of class character
you can use this pipeline to transform from character to factor, then re organize the factor and then make them numeric:
#Code 4
df %>%
mutate(across(gender:smoke,~as.factor(.))) %>%
mutate(across(gender:smoke,~factor(.,levels = unique(.)))) %>%
mutate(across(gender:smoke,~as.numeric(.)))
Output:
gender education smoke
1 1 1 1
2 2 2 2
3 3 3 3