My problem sits inside a loop, I have a large dataset (DF), a subset of which looks like this:
ID Site Species
101 4 x
101 4 y
101 4 z
102 6 x
102 6 z
102 6 a
102 6 b
103 6 a
103 6 z
103 6 c
103 6 x
103 6 y
105 6 x
105 6 y
105 6 a
105 6 z
108 1 x
108 1 a
108 1 c
108 1 z
I would like to randomly select, using each iteration of my loop (so, i
) all rows of an individual ID
from each Site. But crucially, only one ID from each Site. I have a separate function that subsets my large dataset for the number of Sites, so if i=1
then only one of the above Sites (for example) would be present in the subset.
If i=3
, as for this posted example, then I would want all rows of 101, and either all rows of 102, 103 or 105, and all of 108.
I think something like ddply()
with sample()
should do it but I cannot get it to happen randomly.
Any suggestions would be greatly appreciated. thanks
James
How about this? I've added a function to simulate what I think your data looks like.
#dependencies
require(plyr)
#function to make data (just to work with)
make_data<-function(id){
set.seed(id)
num_sites<-round(runif(1)*3,0)+1
num_sp<-round(runif(1)*7,0)+1
sites<-sample(1:10,num_sites,FALSE)
ldply(sites,function(x)data.frame(sites=x,sp=sample(letters[1:26],num_sp,FALSE)))
}
#make a data frame for example use (as per question)
ids<-100:200
df<-ldply(ids,function(x)data.frame(id=x,make_data(x)))
################################################
# HERE'S THE CODE FOR THE ANSWER #
# use ddply to summarise by site & sampled ids #
filter<-ddply(df,.(sites),summarise,set=sample(id,1))
# then apply this filter to the original list
ddply(filter,.(sites),.fun=function(x){return(df[df$site==x$sites & df$id==x$set,])})