Search code examples
rsamplingstatistics-bootstrap

Resampling with replacement


I am trying to get the (bootstrapped) input data for my model.

Source file: https://www.dropbox.com/s/dudzxhozr50uhr7/EddyData_2010.csv?dl=0

library("dplyr")
library("readr")
library("reshape2")
library("ggplot2")

sub <- read_csv("EddyData_2010.csv", 
                col_types = list(col_integer(), col_integer(), col_double(),
                                 col_double(), col_double(), col_double(),
                                 col_double(), col_double(), col_double(),
                                 col_double(), col_double(), col_double()),
                col_names = c("Year", "DoY", "Hour", "NEE", "LE", "H", "Rg",
                              "Tair", "Tsoil", "rH", "Ustar", "VPD")) %>%
  filter(DoY == 170) %>%
  mutate(hour = 1:48) %>%
  select(NEE:hour)

# Number of resampling 
n_resempling <- 1000 
result_resampling <- NULL

# Do the resampling
for (i in 1:n_resempling) {
  result_resampling <- sample(48, 48, replace = T) %>%
    slice(sub, .) %>%
    mutate(resempling_number = i) %>%
    bind_rows(. , result_resampling)
}

This generates a resampling with replacement such as

enter image description here

The output shows 1000 bootstraps resampled with replacement along 48 half-hours that a day have.

Here is my problem:

The resampling with replacement is randomly mixing up half-hours along the day without any kind of control. For instance, I am not interested in mixing up half-hours at night and half-hours during the days. The results leads to weird calculations afterwards. Is there any possibility to code this in such a way I define what is allowed and what is not?

Example:

  • it is only allowed to resample from 10pm till 5pm
  • night hours cannot be resampled with day hours (and vice versa)

Solution

  • I have done naive bootstrap for CRD design but not for time data. Is that time series data? From what I understand, you want 2pm to be sampled only with 2 pm and not 3 pm. Understanding the sampling is important for doing the right analysis because it's really easy to go really wrong in R.

    I noticed that you used a loop to bootstrap instead of a package. I used 'boot' package for naive bootstrap, so I turned to Google to look at time data. Here's what I found, I apologize this is all I've got (I couldn't comment due to lack of rep) Using boot package I would bet anything is faster than using a loop you can check with system.time( ) especially if you have a lot of data.

    https://stat.ethz.ch/R-manual/R-devel/library/boot/html/tsboot.html

    Here's how I worked on my naive bootstrap:

    my.boot.fnx<-function(var, ind,alpha=0.95){
      newdf      <-na.omit(var[ind])
      mymean     <-mean(newdf,na.rm=TRUE)
      mytrimmean <-mean(newdf, trim=0.1, na.rm=TRUE)
      mymedian   <-median(newdf,na.rm=TRUE)
      mysd       <-sd(newdf,na.rm=TRUE)
      nonmiss    <- length(na.omit(newdf))
      semean     <- mysd/sqrt(nonmiss)
      lcl        <- mymean - qt(1-alpha/2,nonmiss-1)*semean
      ucl        <- mymean + qt(1-alpha/2,nonmiss-1)*semean
      mygini     <-
       sum(abs(outer(newdf,newdf,FUN="")))/
       length(newdf)/(length(newdf)-1)*sqrt(pi)/2
       c(mean=mymean,median=mymedian,se=semean, 
       lcl=lcl,ucl=ucl,sd=mysd,gsd=mygini)
    #gini coef is a robust measure of SE
    }
    strap.df <- boot(df$var,statistic=my.boot.fnx, R=1000)
    

    I also found this source for time data http://eranraviv.com/bootstrapping-time-series-r-code/

    Also good resource for proper analytical methods for different designs:

    http://people.stat.sfu.ca/~cschwarz/CourseNotes/

    Anyways, sorry I wasn't too much of help but wanted to share some thoughts.