Search code examples
rif-statementdplyrtidyverselogical-operators

Tidyverse: case_when() not returning correct value


For a set of columns, I wish to detect which column has the max value. If the value between columns is equal, then I would like to use a weight to decide which column to select. I've tried implementing this using case_when() however it does not work. I will show this problem below using an example dataset...

Let's say I have a dataset which includes these three columns (A, B, C) that correspond to the amount of Apple, Bananas or Carrots a child eats per day.

For row I want to log the most consumed food (i.e., column with the highest value). If the value between any column is equal (e.g., 1 apple and 1 banana), then apply the following rank. Apple > Banana > Carrot, whereby if a child eats 1 Apple and 1 Banana, the log will show Apple.

I've tried implementing this in R using by using pairs of if_else statements with case_when(). However, it does not return the correct result. For example, the final row should be classified as Apple, not carrot. I'm not sure what I'm doing wrong. Please provide a Tidyverse solution to this issue and if possible, explain why my approach did not work.

library(tidyverse)

A <- c(1,1,3,3)
B <- c(2,3,1,1)
C <- c(1,1,1,2)
df <- data.frame(A,B,C)

top_food <- df %>% 
  mutate(highest = case_when(
    C > B ~ "carrot", # if carrot > banana
    C > A ~ "carrot", # if carrot > apple 
    B > A ~ "banana", # if banana > apple 
    B >= A ~ "banana", # if banana >= carrot
    A >= B ~ "apple", # if apple  >= banana
    A >= C ~ "apple" # if apple >= carrot
  )) 

> | A | B | C | HIGHEST |  |
> | 1 | 2 | 1 | banana  |  |
> | 1 | 3 | 1 | banana  |  |
> | 3 | 1 | 1 | apple   |  |
>   3 | 1 | 2 | carrot  |  |

Notes: - this is an example dataset. - I'm open to solutions with different functions but please provide Tidyverse answers as this is how I am learning R. If possible, please explain why my approach using case_when() did not work so I can learn. - It's important to maintain the shape/layout of the dataset so it can be used in software outside of R, so please do not convert to long format.


Solution

  • A quick fix could contain dplyr::rowise and which.max(). The downside to this approach is that it is quite slow. Additionally I am assuming that the order of your columns reflects the rank of fruits (if there is a tie, which.max will return the first value).

    A <- c(1,1,3,3,1,1)
    B <- c(2,3,1,1,1,2)
    C <- c(1,1,1,2,1,2)
    
    df <- data.frame(A,B,C)
    labels <- c("apple","banana","carrot")
    
    df %>%
        dplyr::rowwise() %>%
        dplyr::mutate(top=labels[which.max(c(A,B,C))]) %>%
        dplyr::ungroup()
    

    Another (and probably better) approach could be using max.col()

    df <- data.frame(A,B,C)
    labels <- c("apple","banana","carrot")
    df %>%
        dplyr::mutate(top=labels[max.col(df,ties="first")])
    

    And yet another solution independent of your column order (now I am out :D):

    df <- data.frame(A,B,C)
    
    top_food <- df %>% 
        dplyr::mutate(highest = dplyr::case_when(
            C > A & C > B ~ "carrot",
            B > A & B >= C ~ "banana",
            TRUE ~ "apple"))
    

    your approach with case_when did not work, because case_when stops checking conditions as soon as a true condition is found. As your first condition is C>B ~ "carriot" which is true for your last row (2>1), case_when returned "carrot" and did not check the other conditions.