Replacing string values with dplyr

I have a dataframe in R:

df <- data.frame(CRP = c("10", "2", "3", NA, "<4", ">5"))

I want to create a numeric column that replaces the strings into numeric values. Additionally, whenever it find a string starting with “<“ or “>”, it should look in the next letter, i.e. number, and impute the string with the median of the observations above or below that number. For example, “<4” should be replaced with median(c(2,3)). If there are no values below or above the threshold,NA the value.The remaining NAs should remain as it is currently now.

Desired output:

df = data.frame(c(10,2,3,NA,median(c(2,3)),median(10))

I have not managed to find a solution that avoids pre specifying functions a priori. I also want to avoid for loops. Ideally I would like to apply this into a data frame of millions of rows where multiple columns have that issue.

Solution

For multiple variables, you can try the following code, which uses a for loop only for the unique observations that contain "<" or ">" and then replaces all values that satisfy that condition.

df <- structure(list(CRP = c("10", "2", "3", NA, "<4", ">5"), CRP2 = c("10", 
"12", "<5", "NA", ">5", "5")), class = "data.frame", row.names = c(NA, 
-6L))

imputed <- paste0(names(df), "_imputed")

df[imputed] <- sapply(names(df), \(var) {
  x <- df[,var]
  num <- suppressWarnings(as.numeric(x))
  for(s in unique(grep('<|>', x, value=TRUE))) {
    x[which(x==s)] <- eval(parse(text=paste0("median(num[num", s, "], na.rm=TRUE)")))
  }
  suppressWarnings(as.numeric(x))
})
df

   CRP CRP2 CRP_imputed CRP2_imputed
1   10   10        10.0           10
2    2   12         2.0           12
3    3   <5         3.0           NA
4 <NA>   NA          NA           NA
5   <4   >5         2.5           11
6   >5    5        10.0            5

It's quick even for a dataset containing millions of rows.

n <- 5000000

df <- data.frame(CRP = sample(c("10", "2", "3", NA, "<4", ">5"), n, TRUE),
                 CRP2 = sample(c("10", "12", "<5", "NA", ">5", "5"), n, TRUE))

imputed <- paste0(names(df), "_imputed")

system.time(
  df[imputed] <- sapply(names(df), \(var) {
    x <- df[,var]
    num <- suppressWarnings(as.numeric(x))
    for(s in unique(grep('<|>', x, value=TRUE))) {
      x[which(x==s)] <- eval(parse(text=paste0("median(num[num", s, "], na.rm=TRUE)"))) 
    }
    suppressWarnings(as.numeric(x))
  })
)

   user  system elapsed 
   3.39    0.08    5.25