Search code examples
rdplyrmin

Finding the minimum value across columns by group and when a condition is met


I am trying to find the minimum value across two columns for each person in my data. I only want to consider values in one column for the minimum calculation IF a condition (in another column) is met. For example, my data look like this:

df <- data.frame(id = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3), 
age = c(4, 5, 7, 11, 4, 5, 20, 25, 26, 6, 7, 8, 9), 
A = c(0.2, 0.3, 0.5, 0.8, 0.2, 0.3, 1, 2, 3, 0.1, 0.3, 2, 3), 
B = c(0.1, 0.1, 0.1, 0.1, 0.3, 0.3, 0.3, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4), 
C = c(0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4))

df
#  id   age     A     B     C
#   1     4   0.2   0.1   0.1
#   1     5   0.3   0.1   0.1
#   1     7   0.5   0.1   0.1
#   1    11   0.8   0.1   0.1
#   2     4   0.2   0.3   0.2
#   2     5   0.3   0.3   0.2
#   2    20   1.0   0.3   0.2
#   2    25   2.0   0.3   0.2
#   2    26   3.0   0.3   0.2
#   3     6   0.1   0.4   0.4
#   3     7   0.3   0.4   0.4
#   3     8   2.0   0.4   0.4
#   3     9   3.0   0.4   0.4

I want to find the minimum value of columns A and B for each ID. Column B contains the same value repeated whereas column A contains different values. I only want to consider values in column A IF age <= 5. My desired output is in column C.

You'll notice that for person 1, the value in column B is lowest and is returned. For person 2, the value in column A when age = 4 is lowest and is returned. For person 3, the values for column A when age = 6 and age = 7 are lower than the value in column B, BUT because age > 5, I am not considering those values and therefore the value in column B is returned.

I am trying to find a solution with dplyr but am open to anything. I have tried doing this in several steps with group_by and mutate but cannot seem to get anything to work. Many thanks in advance.


Solution

  • library(dplyr)
    
    df |>
      mutate(C = min(A[age <= 5], B), .by = id)
    

    How it works

    Hopefully this is clear how it works when age <= 5 returns TRUE. Consider id when it is 3:

    age <- c(6, 7, 8, 9)
    A <- c(0.1, 0.3, 2.0, 3.0)
    B <- c(0.4, 0.4, 0.4, 0.4)
    
    A[age <= 5]
    numeric(0)
    

    Alone, sub-setting A by this condition returns an empty set. When you combine it with B though you get:

    c(A[age <= 5], B) # only B values
    [1] 0.4 0.4 0.4 0.4
    

    Essentially, when there is an age less than 5, A and B values are compared, but when this condition is not met, only B values are passed to min.

    Output

       id age   A   B   C
    1   1   4 0.2 0.1 0.1
    2   1   5 0.3 0.1 0.1
    3   1   7 0.5 0.1 0.1
    4   1  11 0.8 0.1 0.1
    5   2   4 0.2 0.3 0.2
    6   2   5 0.3 0.3 0.2
    7   2  20 1.0 0.3 0.2
    8   2  25 2.0 0.3 0.2
    9   2  26 3.0 0.3 0.2
    10  3   6 0.1 0.4 0.4
    11  3   7 0.3 0.4 0.4
    12  3   8 2.0 0.4 0.4
    13  3   9 3.0 0.4 0.4