Search code examples
rdplyrreadr

What's best practice when handling "less than" and "greater than" symbols when importing otherwise numeric values?


This is a general question about best practice.

I'm using tidyverse packages to import, read, and then manipulate clinical lab data from a CSV. With clinical data, it's typical to report values outside the limits of quantification with "less than" or "greater than" symbols, eg <250 mg/dL or >2500 mg/dL. So you'll often have a column with mostly numeric values, but a few values that readr will interpret as strings. If I force that column to be numeric, by default the strings are coerced to NAs.

I'd like to be able to tell readr or dplyr to accept those string values (eg <250, >2500) as numeric values, dropping the characters (eg 250, 2500). Ideally this would be done in a pipeline.

In general, what's the most flexible and consistent way to handle this situation?


Solution

  • Trusting one of the packages to know what "<250 mg/dL" means is out of their realm of expertise. For instance, depending on your needs, that value might mean "effectively 0" or "effectively 250" or something else. Whether this applies to drugs or antibodies or chemicals or whatever, it is absolutely contextual.

    Given that, I think the onus is on the programmer/analyst to determine what is appropriate.

    Base R method for doing that, assuming that 2:3 reflects the columns you need to fix:

    dat <- read.csv(text = csv, stringsAsFactors = FALSE)
    str(dat)
    # 'data.frame': 2 obs. of  3 variables:
    #  $ id  : int  1 2
    #  $ val1: chr  "<250 mg/dL" ">250 mg/dL"
    #  $ val2: chr  ">2500 mg/dL" "<2500 mg/dL"
    
    dat[,2:3] <- lapply(dat[,2:3], function(s) as.numeric(gsub("[^-.0-9]", "", s)))
    str(dat)
    # 'data.frame': 2 obs. of  3 variables:
    #  $ id  : int  1 2
    #  $ val1: num  250 250
    #  $ val2: num  2500 2500
    

    A tidyversion:

    library(dplyr)
    readr::read_csv(csv) %>%
      mutate_at(vars(val1, val2), ~ as.numeric(stringr::str_replace_all(., "[^-.0-9]", "")))
    

    data.table is easy, too, adapting the lapply above.


    However, this just assumes that <250 is the same as 250, which does not differentiate "real" values from "less than" values. Consider:

    csv <-'
    id,val1,val2
    1,"<250 mg/dL",">2500 mg/dL"
    2,">250 mg/dL","<2500 mg/dL"
    3,25,2500'
    

    How should row 3 be distinguished from the others? For this, I think you'd like need to incorporate more logic, perhaps ifelse(grepl("<", s), "0", s), etc. Again, this is all contextual, so only the analyst knows what rules should go into interpreting those numbers.