I have a data.frame with two score columns. I want to conditionally use data from one of them on a per-row basis. I explain with an example below...
> dff <- data.frame(dataset = c('Main','Main','b','b','c','c','d','d'),
+ score1 = c(0.01,0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08),
+ score2 = c(0.001, 0.2, 0.003, 0.4, 0.005, 0.6, 0.007, 0.8),
+ name = c('A','B','A','B','A','B','A','B'));
> dff
dataset score1 score2 name
1 Main 0.01 0.001 A
2 Main 0.02 0.200 B
3 b 0.03 0.003 A
4 b 0.04 0.400 B
5 c 0.05 0.005 A
6 c 0.06 0.600 B
7 d 0.07 0.007 A
8 d 0.08 0.800 B
I am trying to select all the values from one score for all rows for which name == 'A'
, and similar for name == 'B'
. Which score I choose depends on which score is lesser when dataset == 'Main'
.
So for example, in this example, when name == 'A'
, score2
is lower than score1
for the Main
dataset. Thus, for all the rows where name == 'A'
, I will use their value from score2
.
When name == 'B'
, score1
is lower than score2
for the Main
dataset. Thus, for all the rows where name == 'B'
, I will use their value from score1
. The final result would look like this:
dataset score1 score2 name final
1 Main 0.01 0.001 A 0.001
2 Main 0.02 0.200 B 0.020
3 b 0.03 0.003 A 0.003
4 b 0.04 0.400 B 0.040
5 c 0.05 0.005 A 0.005
6 c 0.06 0.600 B 0.060
7 d 0.07 0.007 A 0.007
8 d 0.08 0.800 B 0.080
So what I am trying to do is conditionally mutate and create the new final
column depending on both what the name is for that row, and which column was lesser for the Main
dataset with that same name. I'm trying to think of something elegant to accomplish this and not a bunch of weird lines of code, but I haven't been able to accomplish that yet.
EDIT: I'm including a sample of my actual data here.
structure(list(datasets = c("main", "main", "bms", "bms", "sny",
"sny", "chen", "chen", "van", "van"), test_high = c(0.639654382299527,
0.561881930194033, NA, NA, 0.909598942079794, 0.651429614317738,
0.189274551669056, 0.541845226349475, 0.41969855766237, 0.555858598773613
), test_low = c(0.402779917451124, 0.469868712458501, NA, NA,
0.106383376175001, 0.381060050671353, 0.824427629626441, 0.468590829264603,
0.594646024750062, 0.460036802365713), cell = c("high", "low",
"low", "high", "high", "low", "high", "low", "low", "high")), .Names = c("datasets",
"test_high", "test_low", "cell"), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L))
# A tibble: 10 x 4
datasets test_high test_low cell
<chr> <dbl> <dbl> <chr>
1 main 0.6396544 0.4027799 high
2 main 0.5618819 0.4698687 low
3 bms NA NA low
4 bms NA NA high
5 sny 0.9095989 0.1063834 high
6 sny 0.6514296 0.3810601 low
7 chen 0.1892746 0.8244276 high
8 chen 0.5418452 0.4685908 low
9 van 0.4196986 0.5946460 low
10 van 0.5558586 0.4600368 high
Final in this case would end up being the same as test_low because for both Mains (ie when cell is 'high', and cell is 'low') the test_low column is smaller than the test_high column.
One option is case_when
library(dplyr)
dff %>%
mutate(final = case_when(name == "A" & dataset == "Main" ~ score2,
name == "B" & dataset=="Main" ~score1,
TRUE ~ pmin(score1, score2)))
# dataset score1 score2 name final
#1 Main 0.01 0.001 A 0.001
#2 Main 0.02 0.200 B 0.020
#3 b 0.03 0.003 A 0.003
#4 b 0.04 0.400 B 0.040
#5 c 0.05 0.005 A 0.005
#6 c 0.06 0.600 B 0.060
#7 d 0.07 0.007 A 0.007
#8 d 0.08 0.800 B 0.080
Based on the edited dataset ('dfn'),
dfn %>%
filter(datasets == "main") %>%
gather(test, val, test_high:test_low) %>%
group_by(cell) %>%
summarise(test = test[which.max(val)]) %>%
left_join(dfn, .) %>%
rowwise() %>%
mutate(final = get(test)) %>%
select(-test)