Search code examples
rtime-seriesstatistics-bootstrap

How do I sample n values at random nearest to value y when the data aren't continuous?


I have a dataset that includes a list of species, their counts, and the day count from when the survey began. Since many days were not sampled, day is not continuous. So for example, there could be birds counted on day 5,6,9,10,15,34,39 and so on. I set the earliest date to be day 0.

Example data:

species     counts      day
Blue tit    234         0
Blue tit    24          5
Blue tit    45          6
Blue tit    32          9
Blue tit    6           10
Blue tit    98          15
Blue tit    40          34
Blue tit    57          39
Blue tit    81          43
..................

I need to bootstrap this data and get a resulting dataset where I specify when to start, what interval to proceed in and number of points to sample.

Example: Let's say I randomly pick day 5 as the start day, the interval as 30, and number of rows to sample as 2. It means that I will start at 5, add 30 to it, and look for 2 rows around 35 days (but not day 35 itself). In this case I will grab the two rows where day is 34 and 39.

Next I add 30 to 35 and look for two points around 65. Rinse, repeat till I get to the end of the dataset.

I've written this function to do the sampling but it has flaws (see below):

resample <- function(x, ...) x[sample.int(length(x), ...)]
 locate_points<- function(dataz,l,n) #l is the interval, n is # points to sample. This is called by another function that specifies start time among other info.
{
   tlength=0
   i=1
    while(tlength<n)   
    {
        low=l-i
        high=l+i
        if(low<=min(dataz$day)) { low=min(dataz$day) }
        if(high>=max(dataz$day)) { high=max(dataz$day) }
        test=resample(dataz$day[dataz$day>low & dataz$day<high & dataz$day!=l])
          tlength=length(test)
         i=i+1
      } 
  test=sort(test)
  k=test[1:n]
 return (k)
 } 

Two issues I need help with:

  1. While my function does return the desired number of points, it is not centered around my search value. Makes sense because as I get wider, I get more points and when I sort those and pick the first n, They tend not to be the low values.

  2. Second, how do I get the actual rows out? For now I have another function to locate these rows using which, then rbind 'ing those rows together. Seems like there should be a better way.

thanks!


Solution

  • How about something like the following:

    day = 1:1000
    
    search = seq(from=5, to=max(day), by=30)
    x = sort(setdiff(day, search))
    pos = match(x[unlist(lapply(findInterval(search, x), seq, len=2))], day)
    
    day[pos]
    

    To get the rows from your data.frame just subset it:

    rows = data[pos, ]
    

    This is maybe slightly cleaner than the unlist/lapply/seq combo:

    pos = match(x[outer(c(0, 1), findInterval(search, x), `+`)], day)
    

    Also note that if you want a larger window (eg say 4), its just a matter of going back a bit:

    pos = match(x[outer(-1:2, findInterval(search, x), `+`)], day)