Search code examples
rtidyversedata-wrangling

Calculating a cumulative deficit index


I have a dataset of blood analytes that look like this:

alb alp na pot alb_zscore alp_zscore na_zscore pot_zscore
5.37 86. 142 6.73 -7.82 -0.95 0 3.1
4.67 28 127 23.87 -0.35 2.31 1.39 6.3
4.43 170 NA 4.27 -2.29 -0.14 NA -1.4
3.75 12 164 3.05 -2.88 5.86 0.26 -3.5
4.85 NA 139 6.75 -4.72 NA 8.23 -3.64

I have many more analytes/columns but I chose this for simplicity sake. There are ~1500 rows, where each row is an individual. Please note the numbers in the example table are random/made-up (in case you are wondering why the corresponding z-scores seem so wonky).

For context, I'm creating an index of dysregulation for each individual. For some analytes, a value that is too high can represent dysregulation; for other analytes, a low value represents dysregulation; and for other analytes, either a high or low value represents dysregulation. The number of analytes that are in dysregulation per individual are then summed, to create something called a Cumulative Deficit Index (see Noppert et al. 2018 if curious to learn more).

For my code, I aim to split each analyte_zscore into quartiles (although analytes could be split into octiles instead- see more below) and those in the highest (or lowest) quartile of the distribution receive a score of 1. Ultimately, I want a column called 'dysreg_index' that is the sum of the number of analytes for which an individual received a score of 1, divided by the total number of analytes available for that individual. Some individuals have NAs for an analyte value - I do not want any analyte with an NA counted towards the total number of analytes for that individual.

Given that I have many analytes each with their own 'rule' of what constitutes dysregulation, I would like the code to be customizable. For the purpose of this example, pretend that these are the rules of what constitutes dysregulation...

alb_zscore: lower quartile (<25%),
alp_zscore: upper quartle (>75%),
na_zscore: lower or upper quartile (<25% or >75%),
pot_zscore: lower or upper octile (<12.5% or >87.5%)

(Note: this isn't code, but Stack thought it was and wouldn't let me post it without formatting it as such)

Any help in creating this code would be greatly appreciated. Thank you so much!

Source: Noppert GA, Aiello AE, O’Rand AM, Cohen HJ (2018). Investigating pathogen burden in relation to a cumulative deficits index in a representative sample of US adults. Epidemiology and Infection 146, 1968–1976. https://doi.org/10.1017/S095026881800153X


Solution

  • One way is to have a "dictionary" of functions to evaluate 0/1 for a particular column.

    funcs <- list(
      alb_zscore = function(z) !is.na(z) & z < quantile(z, 0.25, na.rm = TRUE),
      alp_zscore = function(z) !is.na(z) & z > quantile(z, 0.75, na.rm = TRUE),
      na_zscore = function(z) !is.na(z) & !between(z, quantile(z, 0.25, na.rm = TRUE), quantile(z, 0.75, na.rm = TRUE)),
      pot_zscore = function(z) !is.na(z) & !between(z, quantile(z, 0.125, na.rm = TRUE), quantile(z, 0.875, na.rm = TRUE))
    )
    

    With that, we can do:

    mapply(function(fn, x) fn(x), funcs, quux[names(funcs)])
    #      alb_zscore alp_zscore na_zscore pot_zscore
    # [1,]       TRUE      FALSE      TRUE      FALSE
    # [2,]      FALSE      FALSE     FALSE       TRUE
    # [3,]      FALSE      FALSE     FALSE      FALSE
    # [4,]      FALSE       TRUE     FALSE      FALSE
    # [5,]      FALSE      FALSE      TRUE       TRUE
    

    With this, we can easily use rowSums to come up with your dysreg_index. Edited to divide by the number of non-NA elements in the row:

    quux %>%
      mutate(
        dysreg_index = {
          num <- mapply(function(fn, x) fn(x), funcs, pick(all_of(names(funcs))))
          den <- (!is.na(pick(all_of(names(funcs)))))
          rowSums(num) / rowSums(den)
        }
      )
    #    alb alp  na   pot alb_zscore alp_zscore na_zscore pot_zscore dysreg_index
    # 1 5.37  86 142  6.73      -7.82      -0.95      0.00       3.10    0.5000000
    # 2 4.67  28 127 23.87      -0.35       2.31      1.39       6.30    0.2500000
    # 3 4.43 170  NA  4.27      -2.29      -0.14        NA      -1.40    0.0000000
    # 4 3.75  12 164  3.05      -2.88       5.86      0.26      -3.50    0.2500000
    # 5 4.85  NA 139  6.75      -4.72         NA      8.23      -3.64    0.6666667
    

    Data

    quux <- structure(list(alb = c(5.37, 4.67, 4.43, 3.75, 4.85), alp = c(86, 28, 170, 12, NA), na = c(142L, 127L, NA, 164L, 139L), pot = c(6.73, 23.87, 4.27, 3.05, 6.75), alb_zscore = c(-7.82, -0.35, -2.29, -2.88, -4.72), alp_zscore = c(-0.95, 2.31, -0.14, 5.86, NA), na_zscore = c(0, 1.39, NA, 0.26, 8.23), pot_zscore = c(3.1, 6.3, -1.4, -3.5, -3.64)), class = "data.frame", row.names = c(NA, -5L))