Search code examples
rdplyrquantilepercentile

How to select observations that are within a certain quantile


I have data (~1000 rows) that look like this:

head(data)

   alt   alb   alp alt_zscore alb_zscore alp_zscore
 <dbl> <dbl> <dbl>      <dbl>      <dbl>      <dbl>
1  11    2.60   9       -1.54      -7.82      -0.949
2  12    5.37  86.3     -1.45      -0.351      2.31 
3  15.7  4.67  28       -1.09      -2.24      -0.148
4   7    4.43 171.      -1.93      -2.89       5.87 
5  14.5  3.75  12       -1.20      -4.72      -0.822
6  17.5  3.70  82.5     -0.915     -4.86       2.15

Each variable column (i.e., alt, alb, alp) has a corresponding z-score column (alt_zscore, alb_zscore, alp_zscore).

In my previous code, I told R, for each z-scored column, if an observation is further than 1 standard deviation lower than the mean, to take the absolute value of the z-scored observation; if it does not meet the criteria, give it a 0. (The reason I do this is because later on in my code, I add up all those z-scored observations into another column.)

Here is my previous code:

name <- c("alt_zscore", "alb_zscore", "alp_zscore")

stdev <- 1

lf <- list(
  \(x) ifelse(x <= -stdev, abs(x), 0),
  \(x) ifelse(x <= -stdev, abs(x), 0),
  \(x) ifelse(x <= -stdev, abs(x), 0)
) %>% 
  setNames(name)

The reason I do this is so I can create a new column "total_score" that is the sum of all the z-scores that fit my criteria.

data <- data %>% 
  mutate(total_score = rowSums(across(all_of(name), ~ lf[[cur_column()]](.)), na.rm = TRUE))

Now, what I am trying to do is tell R, for each 'regular' column (i.e., here I'm referring to 'alt', not 'alt_zscore'), if an observation is less than the 25th percentile of that column, then take the absolute value of its corresponding z-scored column (alt_zscore); otherwise, give it a zero. Note: sometimes I will need to specify the 75th percentile, or the 25th OR the 75th, so I’m hoping for the code to be adjustable in that regard.

I'm trying to modify my existing code to perform this but haven't had luck. Any help would be greatly appreciated. Thank you!


Solution

  • You can use across() with cur_column() to search the corresponding z-score column.

    library(dplyr)
    
    df %>%
      mutate(across(alt:alp,
                    ~ if_else(.x < quantile(.x, .25), abs(get(paste0(cur_column(), "_zscore"))), 0),
                    .names = "{.col}_new"))
    
    #    alt  alb   alp alt_zscore alb_zscore alp_zscore alt_new alb_new alp_new
    # 1 11.0 2.60   9.0     -1.540     -7.820     -0.949    1.54    7.82   0.949
    # 2 12.0 5.37  86.3     -1.450     -0.351      2.310    0.00    0.00   0.000
    # 3 15.7 4.67  28.0     -1.090     -2.240     -0.148    0.00    0.00   0.000
    # 4  7.0 4.43 171.0     -1.930     -2.890      5.870    1.93    0.00   0.000
    # 5 14.5 3.75  12.0     -1.200     -4.720     -0.822    0.00    0.00   0.822
    # 6 17.5 3.70  82.5     -0.915     -4.860      2.150    0.00    4.86   0.000
    

    For your first task, the list of functions (lf) is not needed because all functions are the same.

    df %>% 
      mutate(total_score = rowSums(
        across(alt_zscore:alp_zscore, ~ ifelse(.x <= -1, abs(.x), 0)), na.rm = TRUE
      ))
    

    Data
    df <- read.table(text =
    "   alt   alb   alp alt_zscore alb_zscore alp_zscore
    1  11    2.60   9       -1.54      -7.82      -0.949
    2  12    5.37  86.3     -1.45      -0.351      2.31 
    3  15.7  4.67  28       -1.09      -2.24      -0.148
    4   7    4.43 171.      -1.93      -2.89       5.87 
    5  14.5  3.75  12       -1.20      -4.72      -0.822
    6  17.5  3.70  82.5     -0.915     -4.86       2.15")