Search code examples
rdplyrnse

dplyr mutate using conditional column and specific rows


I have a data.frame with two score columns. I want to conditionally use data from one of them on a per-row basis. I explain with an example below...

> dff <- data.frame(dataset = c('Main','Main','b','b','c','c','d','d'), 
+                  score1 = c(0.01,0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08), 
+                  score2 = c(0.001, 0.2, 0.003, 0.4, 0.005, 0.6, 0.007, 0.8),
+                  name = c('A','B','A','B','A','B','A','B')); 
> dff
  dataset score1 score2 name
1    Main   0.01  0.001    A
2    Main   0.02  0.200    B
3       b   0.03  0.003    A
4       b   0.04  0.400    B
5       c   0.05  0.005    A
6       c   0.06  0.600    B
7       d   0.07  0.007    A
8       d   0.08  0.800    B

I am trying to select all the values from one score for all rows for which name == 'A', and similar for name == 'B'. Which score I choose depends on which score is lesser when dataset == 'Main'.

So for example, in this example, when name == 'A', score2 is lower than score1 for the Main dataset. Thus, for all the rows where name == 'A', I will use their value from score2.

When name == 'B', score1 is lower than score2 for the Main dataset. Thus, for all the rows where name == 'B', I will use their value from score1. The final result would look like this:

  dataset score1 score2 name final
1    Main   0.01  0.001    A 0.001
2    Main   0.02  0.200    B 0.020
3       b   0.03  0.003    A 0.003
4       b   0.04  0.400    B 0.040
5       c   0.05  0.005    A 0.005
6       c   0.06  0.600    B 0.060
7       d   0.07  0.007    A 0.007
8       d   0.08  0.800    B 0.080

So what I am trying to do is conditionally mutate and create the new final column depending on both what the name is for that row, and which column was lesser for the Main dataset with that same name. I'm trying to think of something elegant to accomplish this and not a bunch of weird lines of code, but I haven't been able to accomplish that yet.

EDIT: I'm including a sample of my actual data here.

structure(list(datasets = c("main", "main", "bms", "bms", "sny", 
"sny", "chen", "chen", "van", "van"), test_high = c(0.639654382299527, 
0.561881930194033, NA, NA, 0.909598942079794, 0.651429614317738, 
0.189274551669056, 0.541845226349475, 0.41969855766237, 0.555858598773613
), test_low = c(0.402779917451124, 0.469868712458501, NA, NA, 
0.106383376175001, 0.381060050671353, 0.824427629626441, 0.468590829264603, 
0.594646024750062, 0.460036802365713), cell = c("high", "low", 
"low", "high", "high", "low", "high", "low", "low", "high")), .Names = c("datasets", 
"test_high", "test_low", "cell"), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -10L))


   # A tibble: 10 x 4
   datasets test_high  test_low  cell
      <chr>     <dbl>     <dbl> <chr>
 1     main 0.6396544 0.4027799  high
 2     main 0.5618819 0.4698687   low
 3      bms        NA        NA   low
 4      bms        NA        NA  high
 5      sny 0.9095989 0.1063834  high
 6      sny 0.6514296 0.3810601   low
 7     chen 0.1892746 0.8244276  high
 8     chen 0.5418452 0.4685908   low
 9      van 0.4196986 0.5946460   low
10      van 0.5558586 0.4600368  high

Final in this case would end up being the same as test_low because for both Mains (ie when cell is 'high', and cell is 'low') the test_low column is smaller than the test_high column.


Solution

  • One option is case_when

    library(dplyr)
    dff %>%
        mutate(final =  case_when(name == "A" & dataset == "Main" ~ score2,
                           name == "B" & dataset=="Main" ~score1, 
                           TRUE ~ pmin(score1, score2)))
    #  dataset score1 score2 name final
    #1    Main   0.01  0.001    A 0.001
    #2    Main   0.02  0.200    B 0.020
    #3       b   0.03  0.003    A 0.003
    #4       b   0.04  0.400    B 0.040
    #5       c   0.05  0.005    A 0.005
    #6       c   0.06  0.600    B 0.060
    #7       d   0.07  0.007    A 0.007
    #8       d   0.08  0.800    B 0.080
    

    Based on the edited dataset ('dfn'),

    dfn %>%
         filter(datasets == "main") %>% 
         gather(test, val, test_high:test_low) %>% 
         group_by(cell) %>% 
         summarise(test = test[which.max(val)]) %>% 
         left_join(dfn, .) %>%
         rowwise() %>% 
         mutate(final = get(test)) %>%
         select(-test)