I have a dataframe in R:
df <- data.frame(CRP = c("10", "2", "3", NA, "<4", ">5"))
I want to create a numeric column that replaces the strings into numeric values. Additionally, whenever it find a string starting with “<“ or “>”, it should look in the next letter, i.e. number, and impute the string with the median of the observations above or below that number. For example, “<4” should be replaced with median(c(2,3)). If there are no values below or above the threshold,NA the value.The remaining NAs should remain as it is currently now.
Desired output:
df = data.frame(c(10,2,3,NA,median(c(2,3)),median(10))
I have not managed to find a solution that avoids pre specifying functions a priori. I also want to avoid for loops. Ideally I would like to apply this into a data frame of millions of rows where multiple columns have that issue.
For multiple variables, you can try the following code, which uses a for loop only for the unique observations that contain "<" or ">" and then replaces all values that satisfy that condition.
df <- structure(list(CRP = c("10", "2", "3", NA, "<4", ">5"), CRP2 = c("10",
"12", "<5", "NA", ">5", "5")), class = "data.frame", row.names = c(NA,
-6L))
imputed <- paste0(names(df), "_imputed")
df[imputed] <- sapply(names(df), \(var) {
x <- df[,var]
num <- suppressWarnings(as.numeric(x))
for(s in unique(grep('<|>', x, value=TRUE))) {
x[which(x==s)] <- eval(parse(text=paste0("median(num[num", s, "], na.rm=TRUE)")))
}
suppressWarnings(as.numeric(x))
})
df
CRP CRP2 CRP_imputed CRP2_imputed
1 10 10 10.0 10
2 2 12 2.0 12
3 3 <5 3.0 NA
4 <NA> NA NA NA
5 <4 >5 2.5 11
6 >5 5 10.0 5
It's quick even for a dataset containing millions of rows.
n <- 5000000
df <- data.frame(CRP = sample(c("10", "2", "3", NA, "<4", ">5"), n, TRUE),
CRP2 = sample(c("10", "12", "<5", "NA", ">5", "5"), n, TRUE))
imputed <- paste0(names(df), "_imputed")
system.time(
df[imputed] <- sapply(names(df), \(var) {
x <- df[,var]
num <- suppressWarnings(as.numeric(x))
for(s in unique(grep('<|>', x, value=TRUE))) {
x[which(x==s)] <- eval(parse(text=paste0("median(num[num", s, "], na.rm=TRUE)")))
}
suppressWarnings(as.numeric(x))
})
)
user system elapsed
3.39 0.08 5.25