r for-loop coding-efficiency data-generation

How to increase performance when randomly selecting clusters and adding observations?

In a clustered dataset, I want to randomly pick some clusters and then add some simulated observations to the selected clusters. Then I want to create a dataset that combines the simulated and original observations from the selected clusters with all the original observations from the unselected clusters. I would also like to repeat this process many times and thus create many (maybe 1000) new datasets. I managed to do this using for loop but would like to know if there is any more efficient and concise way to accomplish this. Here is an example dataset:

## simulate some data
y <- rnorm(20)
x <- rnorm(20)
z <- rep(1:5, 4)
w <- rep(1:4, each=5)
dd <- data.frame(id=z, cluster=w, x=x, y=y)
#    id cluster           x           y
# 1   1       1  0.30003855  0.65325768
# 2   2       1 -1.00563626 -0.12270866
# 3   3       1  0.01925927 -0.41367651
# 4   4       1 -1.07742065 -2.64314895
# 5   5       1  0.71270333 -0.09294102
# 6   1       2  1.08477509  0.43028470
# 7   2       2 -2.22498770  0.53539884
# 8   3       2  1.23569346 -0.55527835
# 9   4       2 -1.24104450  1.77950291
# 10  5       2  0.45476927  0.28642442
# 11  1       3  0.65990264  0.12631586
# 12  2       3 -0.19988983  1.27226678
# 13  3       3 -0.64511396 -0.71846622
# 14  4       3  0.16532102 -0.45033862
# 15  5       3  0.43881870  2.39745248
# 16  1       4  0.88330282  0.01112919
# 17  2       4 -2.05233698  1.63356842
# 18  3       4 -1.63637927 -1.43850664
# 19  4       4  1.43040234 -0.19051680
# 20  5       4  1.04662885  0.37842390

cl <- split(dd, dd$cluster)  ## split the data based on clusters 
k <- length(dd$id)
l <- length(cl)

`%notin%` <- Negate(`%in%`)  ## define "not in" to exclude unselected clusters so
                             ## as to retain their original observations

A clsamp function in the following code is then created which includes two for loops. The first for loop is to exclude the unselected clusters and the second for loop is to simulate new observations and append them to the selected clusters. Note that I randomly sample 2 clusters (10% of the total number of observations), without replacement

clsamp <- function(cl, k) {
  a <- sample(cl, size=0.1*k, replace=FALSE)  
  jud <- (names(cl) %notin% names(a))  
  need <- names(cl)[jud] 
  T3 <- NULL
  for (k in need) {
    T3 <- rbind(T3, cl[[k]])  
  }
  subt <- NULL
  s <- a
  for (j in 1:2) {
    y <- rnorm(2)
    x <- rnorm(2)
    d <- cbind(id=nrow(a[[j]]) + c(1:length(x)), 
               cluster=unique(a[[j]]$cluster), x, y)
    s[[j]] <- rbind(a[[j]], d)
    subt <- rbind(subt, s[[j]])
  }
  T <- rbind(T3, subt)
  return(T)
}

Finally, this creates a list of 5 datasets each of which combines the simulated and original observations from the selected clusters with all the original observations from the unselected clusters

Q <- vector(mode="list", length=5)
for (i in 1:length(Q)) {
  Q[[i]] <- clsamp(cl, 20)
}

Anyone knows a shorter way to do this? Maybe use the replicate function? Thanks.

Solution

This generates a sizeX2 matrix of random values and cbinds sampled cluster names and consecutive ids to it. It directly starts with dd and also works when you convert dd to a matrix mm, which might be slightly faster. Output is a data frame, though. Instead of your k I use f to directly calculate the number of rows that should be added to the two selected clusters. In case the size gets zero, the original data frame is returned.

clsamp2 <- function(m, f=.1) {
  size <- round(nrow(m)*f)
  if (size == 0) as.data.frame(m)
  else {
    ids <- unique(m[,1])
    cls <- unique(m[,2])
    rd <- matrix(rnorm(size * 4), ncol=2, dimnames=list(NULL, c("x", "y")))
    out <- rbind.data.frame(m, cbind(id=rep(max(ids) + 1:size, each=2), 
                                     cluster=sample(cls, 2), rd))
    `rownames<-`(out[order(out$cluster, out$id), ], NULL)
  }
}

Result

set.seed(42)  ## same seed also used for creating `dd`
clsamp2(dd, .1)

## or
mm <- as.matrix(dd)
clsamp2(mm, .1)

#    id cluster           x           y
# 1   1       1 -0.30663859  1.37095845
# 2   2       1 -1.78130843 -0.56469817
# 3   3       1 -0.17191736  0.36312841
# 4   4       1  1.21467470  0.63286260
# 5   5       1  1.89519346  0.40426832
# 6   1       2 -0.43046913 -0.10612452
# 7   2       2 -0.25726938  1.51152200
# 8   3       2 -1.76316309 -0.09465904
# 9   4       2  0.46009735  2.01842371
# 10  5       2 -0.63999488 -0.06271410
# 11  6       2  1.37095845  0.40426832
# 12  7       2  0.36312841  1.51152200
# 13  1       3  0.45545012  1.30486965
# 14  2       3  0.70483734  2.28664539
# 15  3       3  1.03510352 -1.38886070
# 16  4       3 -0.60892638 -0.27878877
# 17  5       3  0.50495512 -0.13332134
# 18  1       4 -1.71700868  0.63595040
# 19  2       4 -0.78445901 -0.28425292
# 20  3       4 -0.85090759 -2.65645542
# 21  4       4 -2.41420765 -2.44046693
# 22  5       4  0.03612261  1.32011335
# 23  6       4 -0.56469817 -0.10612452
# 24  7       4  0.63286260 -0.09465904

To create the list of five samples, you may use replicate.

replicate(5, clsamp2(dd, .1), simplify=FALSE)

Running time is negligible.

system.time(replicate(1000, clsamp2(dd, .1), simplify=FALSE))
# user  system elapsed 
# 0.44    0.03    0.44