Search code examples
rstatisticssamplingresampling

Sampling from columns of ys stacked over values of x in R (visual provided)


Background

I have a two variables called x and y (please see R code below the picture). When I plot(x, y), I obtain the top-row plot (see below). y values are stacked over the top of each x value.

Question

I am wondering WHY when I sample from y values that are separately stacked over the top of each x value (e.g., y-values stacked over the top of x value of "0"), I get some sampled y values that are outside their range of their mother sample!? (please see the bottom-row table to see this).

enter image description here

HERE IS MY R CODE:

 #############  Input Values ###################
                      each.sub.pop.n = 150; 
                      sub.pop.means = 20:10; 
                      predict.range = 0:10; 
                      sub.pop.sd = .75;
                      n.sample = 2;
 #############################################
  par( mar = c(2, 4.1, 2.1, 2.1) )

  m = matrix( c(1, 2), nrow = 2, ncol = 1 ); layout(m)

  Vec.rnorm <- Vectorize(function(n, mean, sd) rnorm(n, mean, sd), 'mean')

  y <- c( Vec.rnorm(each.sub.pop.n, sub.pop.means, sub.pop.sd) )

  x <- rep(predict.range, each = each.sub.pop.n)

  plot(x, y)


  ## Unsuccessfull Sampling ##
  x <- rep(predict.range, each = n.sample)

  y <- sample(y , length(x), replace = TRUE)

  plot(x, y)

Solution

  • It seems to me that your sample is not conditional on x in your unsuccessful sampling piece. In the below, I split the y data by x and then sampled two cases from each. The result seems to work.

    sample <- lapply(split(y, x), function(z) sample(z, n.sample, replace = TRUE))
    sample <- data.frame(y = unlist(sample), 
                         x = as.numeric(rep(names(sample), each = n.sample)))
    plot(sample$x, sample$y)
    

    enter image description here