I have a very messy dataset where researchers did not match levels of data across sessions. In one session, a '[digit]: ' or '[digit] : ' was added.
I created a dataframe of session 3, 6, and 10 visits from the SWAN dataset. You can download what I'm working with here: LINK
Here's an example of the levels:
levels(dat$ABLECLM)
[1] "(1) Always" "(2) Almost Always"
[3] "(3) Sometimes" "(4) Almost Never"
[5] "(5) Never" "(6) No Intercourse In Last 6 Mons"
[7] "(1) 1 : Always" "(2) 2 : Almost always"
[9] "(3) 3 : Sometimes" "(4) 4 : Almost never"
[11] "(5) 5 : Never" "(6) 6 : No intercourse in last 6 months"
[13] "(2) Almost always" "(4) Almost never"
[15] "(6) No intercourse in last 6 months"
I wrote this function to use in an apply function:
match_levels <- function(column){
if(is.factor(column)){
levels(column) <- sapply(column, tolower)
levels(column) <- sapply(column,function(x) sub('\\d: |\\d : ', '', levels(x)))
return(column)
}else{
return(column)
}
}
It works on a single column, but when I try and apply to each column I have:
dat <- read.csv(<data_link>)
df <- data.frame(apply(dat,2, function(x) x=match_levels(x)))
I get this:
levels(as.factor(df$ABLECLM))
[1] "(1) 1 : Always"
[2] "(1) Always"
[3] "(2) 2 : Almost always"
[4] "(2) Almost always"
[5] "(2) Almost Always"
[6] "(3) 3 : Sometimes"
[7] "(3) Sometimes"
[8] "(4) 4 : Almost never"
[9] "(4) Almost never"
[10] "(4) Almost Never"
[11] "(5) 5 : Never"
[12] "(5) Never"
You can use your original logic of updating the levels of the factor, rather than the value of the variable, which requires factoring again.
new_levels <- function(vec) {
if (is.factor(vec)) {
lvls <- gsub('\\d: |\\d : ', '', tolower(levels(vec)))
dups <- which(duplicated(substr(lvls,1,3)))
lvls[dups] <- lvls[dups-1]
lvls[dups] <- lvls[dups-1]
levels(vec) <- lvls
return(vec)
} else {
return(vec)
}
}
df[] <- lapply(df, new_levels)
> microbenchmark::microbenchmark(lapply(df, new_levels),lapply(dat, match_levels))
Unit: milliseconds
expr min lq mean median uq max
lapply(df, new_levels) 4.875398 5.13453 6.654272 5.737605 6.568733 80.17889
lapply(dat, match_levels) 516.539473 532.93423 539.665595 536.944488 541.752497 615.87178