Search code examples
rdplyrstringr

Replacing string values with dplyr


I have a dataframe in R:

df <- data.frame(CRP = c("10", "2", "3", NA, "<4", ">5"))

I want to create a numeric column that replaces the strings into numeric values. Additionally, whenever it find a string starting with “<“ or “>”, it should look in the next letter, i.e. number, and impute the string with the median of the observations above or below that number. For example, “<4” should be replaced with median(c(2,3)). If there are no values below or above the threshold,NA the value.The remaining NAs should remain as it is currently now.

Desired output:

df = data.frame(c(10,2,3,NA,median(c(2,3)),median(10))

I have not managed to find a solution that avoids pre specifying functions a priori. I also want to avoid for loops. Ideally I would like to apply this into a data frame of millions of rows where multiple columns have that issue.


Solution

  • For multiple variables, you can try the following code, which uses a for loop only for the unique observations that contain "<" or ">" and then replaces all values that satisfy that condition.

    df <- structure(list(CRP = c("10", "2", "3", NA, "<4", ">5"), CRP2 = c("10", 
    "12", "<5", "NA", ">5", "5")), class = "data.frame", row.names = c(NA, 
    -6L))
    
    imputed <- paste0(names(df), "_imputed")
    
    df[imputed] <- sapply(names(df), \(var) {
      x <- df[,var]
      num <- suppressWarnings(as.numeric(x))
      for(s in unique(grep('<|>', x, value=TRUE))) {
        x[which(x==s)] <- eval(parse(text=paste0("median(num[num", s, "], na.rm=TRUE)")))
      }
      suppressWarnings(as.numeric(x))
    })
    df
    

       CRP CRP2 CRP_imputed CRP2_imputed
    1   10   10        10.0           10
    2    2   12         2.0           12
    3    3   <5         3.0           NA
    4 <NA>   NA          NA           NA
    5   <4   >5         2.5           11
    6   >5    5        10.0            5
    

    It's quick even for a dataset containing millions of rows.

    n <- 5000000
    
    df <- data.frame(CRP = sample(c("10", "2", "3", NA, "<4", ">5"), n, TRUE),
                     CRP2 = sample(c("10", "12", "<5", "NA", ">5", "5"), n, TRUE))
    
    imputed <- paste0(names(df), "_imputed")
    
    system.time(
      df[imputed] <- sapply(names(df), \(var) {
        x <- df[,var]
        num <- suppressWarnings(as.numeric(x))
        for(s in unique(grep('<|>', x, value=TRUE))) {
          x[which(x==s)] <- eval(parse(text=paste0("median(num[num", s, "], na.rm=TRUE)"))) 
        }
        suppressWarnings(as.numeric(x))
      })
    )
    
       user  system elapsed 
       3.39    0.08    5.25