Search code examples
rregexapply

conditional sapply to change levels of DF factors


I have a very messy dataset where researchers did not match levels of data across sessions. In one session, a '[digit]: ' or '[digit] : ' was added.

I created a dataframe of session 3, 6, and 10 visits from the SWAN dataset. You can download what I'm working with here: LINK

Here's an example of the levels:

levels(dat$ABLECLM)
 [1] "(1) Always"                              "(2) Almost Always"                      
 [3] "(3) Sometimes"                           "(4) Almost Never"                       
 [5] "(5) Never"                               "(6) No Intercourse In Last 6 Mons"      
 [7] "(1) 1 : Always"                          "(2) 2 : Almost always"                  
 [9] "(3) 3 : Sometimes"                       "(4) 4 : Almost never"                   
[11] "(5) 5 : Never"                           "(6) 6 : No intercourse in last 6 months"
[13] "(2) Almost always"                       "(4) Almost never"                       
[15] "(6) No intercourse in last 6 months"

I wrote this function to use in an apply function:

match_levels <- function(column){
    if(is.factor(column)){
        levels(column) <- sapply(column, tolower)
        levels(column) <- sapply(column,function(x) sub('\\d: |\\d : ', '', levels(x)))
        return(column)
    }else{
        return(column)
    }
}

It works on a single column, but when I try and apply to each column I have:

dat <- read.csv(<data_link>)
df <- data.frame(apply(dat,2, function(x) x=match_levels(x)))

I get this:

levels(as.factor(df$ABLECLM))
 [1] "(1) 1 : Always"  
 [2] "(1) Always"  
 [3] "(2) 2 : Almost always" 
 [4] "(2) Almost always"    
 [5] "(2) Almost Always"     
 [6] "(3) 3 : Sometimes"     
 [7] "(3) Sometimes"         
 [8] "(4) 4 : Almost never" 
 [9] "(4) Almost never"      
 [10] "(4) Almost Never"      
 [11] "(5) 5 : Never"         
 [12] "(5) Never" 

Solution

  • You can use your original logic of updating the levels of the factor, rather than the value of the variable, which requires factoring again.

    new_levels <- function(vec) {
      if (is.factor(vec)) {
        lvls <- gsub('\\d: |\\d : ', '', tolower(levels(vec)))
        dups <- which(duplicated(substr(lvls,1,3)))
        lvls[dups] <- lvls[dups-1]
        lvls[dups] <- lvls[dups-1]
        levels(vec) <- lvls
        return(vec)
      } else {
        return(vec)
      }
    }
    
    df[] <- lapply(df, new_levels)
    
    > microbenchmark::microbenchmark(lapply(df, new_levels),lapply(dat, match_levels))
    Unit: milliseconds
                          expr        min        lq       mean     median         uq       max
        lapply(df, new_levels)   4.875398   5.13453   6.654272   5.737605   6.568733  80.17889
     lapply(dat, match_levels) 516.539473 532.93423 539.665595 536.944488 541.752497 615.87178