Can anyone think how to how to do this in R?
Simple data:
seq<-c("A","A","A","B","B","B","B")
rank<-c(1,2,3,1,2,3,4)
match<-c("y","n","y","n","n","y","y")
df<- as.data.frame(cbind(seq,rank,match))
seq rank match
1 A 1 y
2 A 2 n
3 A 3 y
4 B 1 n
5 B 2 n
6 B 3 y
7 B 4 y
I want to create a ‘choose’
column whereby, for every unique seq, the first instance of y
under match
is given a T
, and all the rest are given a F
.
The desired output would be:
seq rank match choose
1 A 1 y T
2 A 2 n F
3 A 3 y F
4 B 1 n F
5 B 2 n F
6 B 3 y T
7 B 4 y F
It’s easy enough to return the correct value for rank 1 and 2 by combining an ifelse
statement with lag, however I'm stumped once rank >2
.
The real dataset
contains+100k rows
, and rank could go up to the hundreds, so I don’t want to just extend an ifelse
statement to check values above using lag.
My ultimate aim is to create a new dataset
from all of the T's under "choose", so if someone know how to directly pull them out without creating a new column that would be even better!
I’m guessing an ifelse
statement is a daft way of doing this, but I’m stuck :/
Any help would be much appreciated : )
One dplyr
possibility could be:
df %>%
group_by(seq) %>%
mutate(choose = +(match == "y") * (cumsum(match == "y") == 1))
seq rank match choose
<fct> <fct> <fct> <int>
1 A 1 y 1
2 A 2 n 0
3 A 3 y 0
4 B 1 n 0
5 B 2 n 0
6 B 3 y 1
7 B 4 y 0
And if you want to have TRUE/FALSE values:
df %>%
group_by(seq) %>%
mutate(choose = as.logical(+(match == "y") * (cumsum(match == "y") == 1)))
seq rank match choose
<fct> <fct> <fct> <lgl>
1 A 1 y TRUE
2 A 2 n FALSE
3 A 3 y FALSE
4 B 1 n FALSE
5 B 2 n FALSE
6 B 3 y TRUE
7 B 4 y FALSE
And the same with base R
:
with(df, ave(match, seq, FUN = function(x) +(x == "y") * (cumsum(x == "y") == 1)))
Or:
with(df, ave(match, seq, FUN = function(x) as.logical(+(x == "y") * (cumsum(x == "y") == 1))))