Search code examples
rsample

R sample until a condition is met


So i have the following dataframe:

structure(list(V1 = c(45L, 17L, 28L, 26L, 18L, 41L, 26L, 20L, 
23L, 31L, 48L, 23L, 32L, 18L, 30L, 11L, 26L)), .Names = "V1", row.names = c("24410", 
"26526", "26527", "43264", "63594", "125630", "148318", "245516", 
"269500", "293171", "301217", "400294", "401765", "520084", "545501", 
"564914", "742654"), class = "data.frame")

The rownames represent parcels and V1 shows the number of examples per parcel I can draw from. What I want is to take a sample from every parcel proportional to the number of examples available, whereabouts in the end ending up with a total of 400 examples per parcel. The idea being not to oversample one parcel as respect to the others.

The dataset from which the sampling is going on is here.

So far the code looks like this:

df <- read.csv('/data/samplefrom.csv')
df.training <- data.frame()
n <- 400

for(crop in sort(unique(df$code_surveyed))){
  for (bbch_stage in sort(unique(df$bbch))) {
    df.int <- df[df$bbch==bbch_stage & df$code_surveyed == crop,]
    df.int <- df.int[!is.na(df.int$name),]
    rawnum <- nrow(df[df$bbch==bbch_stage & df$code_surveyed == crop,])
    if(rawnum >= n){
      df.bbch.slected<-df[df$bbch==bbch_stage & df$code_surveyed == crop,]
      df.bbch.slected.sampled<-df.bbch.slected[sample(nrow(df.bbch.slected), n),] #(round(n_bbch*length(which(df$bbch==bbch_stage))))), ]
      df.training<-rbind(df.training,df.bbch.slected.sampled)
    }

  }
}

What this does is to sample 400 examples at random for each crop + bbch_stage combination (understand this as a composite variable). That's all fine and dandy, but I want to be able to control from which parcel (variable objectid) do the examples come from. In essence an extra filtering step while sampling.

I've tried a few attempts with while and repeat statements and also with the stratified function from devtools, but none of them seem to produce what I'm after.


Solution

  • Well after some ups and downs I got to this point:

    df.training<-data.frame()
    for (crop in unique(df$code)) {
      df.crop.slected<-df[df$code==crop,]
      df.crop.slected.sampled <- data.frame()
      while(nrow(df.crop.slected.sampled) < 400){
        for(parcel in 1:length(unique(df.crop.slected$objectid))){
          df.crop.slected.pacel <- df.crop.slected[df.crop.slected$objectid == unique(df.crop.slected$objectid)[parcel],]
          df.crop.slected.pacel <- df.crop.slected.pacel[sample(nrow(df.crop.slected.pacel), 1), ]
          if(! df.crop.slected.pacel$name %in% df.crop.slected.sampled$name){
            df.crop.slected.sampled <- rbind(df.crop.slected.sampled, df.crop.slected.pacel)
          }
    
        }
      }
      df.training<-rbind(df.training,df.crop.slected.sampled)
    }
    

    while certainly not the most elegant, it does the job. Would still very much appreciate if someone can direct me to a function for stratified sampling that allows for this in an easier way.