Search code examples
rfactors

How to create a factor variable based on the levels of the same variable in a different data frame


I have two data frames: main_df is the master table. addl_df is a smaller table.

GOAL: to convert all character variables in addl_df as factors with the same levels as the character variables with the same names in main_df.

main_df <- data.frame(id=c(1, 2, 3, 4, 5), age=c(10, 20, 30, 40, 45), gender=c("F","F","M","M","F"), city=c("A","B","C","D","D"))
addl_df <- data.frame(id=c(7,8), age=c( 40, 45), gender=c("F","F"), city=c("C","D"))

Using the code below, city will be a factor variable with 2 levels ("C" and "D). What I want is a factor w/ 4 levels "A","B","C","D" with "C" has value of 3 (same as how it's defined in main_df).

Is it possible to do this in an automated way (instead of manually defining the variable one-by-one? Thank you!

main_df[sapply(main_df, is.character)] <- lapply(main_df[sapply(main_df, is.character)], as.factor) 
addl_df[sapply(addl_df, is.character)] <- lapply(addl_df[sapply(addl_df, is.character)], as.factor)

enter image description here


Solution

  • One option is to bind the datasets with bind_rows, while creating a data identifier ('grp'), convert the character columns to factor, do a group_split by the 'grp' into a list of data.frames, then set the names of the list with setNames and update the original objects with list2env

    library(dplyr)
    bind_rows(main_df, addl_df, .id = 'grp') %>% 
        mutate(across(where(is.character), factor)) %>%
        group_split(grp, .keep = FALSE) %>%
        setNames(c('main_df', 'addl_df')) %>%
        list2env(.GlobalEnv)
    

    -output

    > str(main_df)
    tibble [5 × 4] (S3: tbl_df/tbl/data.frame)
     $ id    : num [1:5] 1 2 3 4 5
     $ age   : num [1:5] 10 20 30 40 45
     $ gender: Factor w/ 2 levels "F","M": 1 1 2 2 1
     $ city  : Factor w/ 4 levels "A","B","C","D": 1 2 3 4 4
    > str(addl_df)
    tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
     $ id    : num [1:2] 7 8
     $ age   : num [1:2] 40 45
     $ gender: Factor w/ 2 levels "F","M": 1 1
     $ city  : Factor w/ 4 levels "A","B","C","D": 3 4