I am trying to get the (bootstrapped) input data for my model.
Source file: https://www.dropbox.com/s/dudzxhozr50uhr7/EddyData_2010.csv?dl=0
library("dplyr")
library("readr")
library("reshape2")
library("ggplot2")
sub <- read_csv("EddyData_2010.csv",
col_types = list(col_integer(), col_integer(), col_double(),
col_double(), col_double(), col_double(),
col_double(), col_double(), col_double(),
col_double(), col_double(), col_double()),
col_names = c("Year", "DoY", "Hour", "NEE", "LE", "H", "Rg",
"Tair", "Tsoil", "rH", "Ustar", "VPD")) %>%
filter(DoY == 170) %>%
mutate(hour = 1:48) %>%
select(NEE:hour)
# Number of resampling
n_resempling <- 1000
result_resampling <- NULL
# Do the resampling
for (i in 1:n_resempling) {
result_resampling <- sample(48, 48, replace = T) %>%
slice(sub, .) %>%
mutate(resempling_number = i) %>%
bind_rows(. , result_resampling)
}
This generates a resampling with replacement such as
The output shows 1000 bootstraps resampled with replacement along 48 half-hours that a day have.
Here is my problem:
The resampling with replacement is randomly mixing up half-hours along the day without any kind of control. For instance, I am not interested in mixing up half-hours at night and half-hours during the days. The results leads to weird calculations afterwards. Is there any possibility to code this in such a way I define what is allowed and what is not?
Example:
I have done naive bootstrap for CRD design but not for time data. Is that time series data? From what I understand, you want 2pm to be sampled only with 2 pm and not 3 pm. Understanding the sampling is important for doing the right analysis because it's really easy to go really wrong in R.
I noticed that you used a loop to bootstrap instead of a package. I used 'boot' package for naive bootstrap, so I turned to Google to look at time data. Here's what I found, I apologize this is all I've got (I couldn't comment due to lack of rep) Using boot package I would bet anything is faster than using a loop you can check with system.time( ) especially if you have a lot of data.
https://stat.ethz.ch/R-manual/R-devel/library/boot/html/tsboot.html
Here's how I worked on my naive bootstrap:
my.boot.fnx<-function(var, ind,alpha=0.95){
newdf <-na.omit(var[ind])
mymean <-mean(newdf,na.rm=TRUE)
mytrimmean <-mean(newdf, trim=0.1, na.rm=TRUE)
mymedian <-median(newdf,na.rm=TRUE)
mysd <-sd(newdf,na.rm=TRUE)
nonmiss <- length(na.omit(newdf))
semean <- mysd/sqrt(nonmiss)
lcl <- mymean - qt(1-alpha/2,nonmiss-1)*semean
ucl <- mymean + qt(1-alpha/2,nonmiss-1)*semean
mygini <-
sum(abs(outer(newdf,newdf,FUN="")))/
length(newdf)/(length(newdf)-1)*sqrt(pi)/2
c(mean=mymean,median=mymedian,se=semean,
lcl=lcl,ucl=ucl,sd=mysd,gsd=mygini)
#gini coef is a robust measure of SE
}
strap.df <- boot(df$var,statistic=my.boot.fnx, R=1000)
I also found this source for time data http://eranraviv.com/bootstrapping-time-series-r-code/
Also good resource for proper analytical methods for different designs:
http://people.stat.sfu.ca/~cschwarz/CourseNotes/
Anyways, sorry I wasn't too much of help but wanted to share some thoughts.