Search code examples
rpermutationsimulationshufflemontecarlo

R: Monte Carlo procedure by Permute or Sample function to generate null distribution


sample data

From this data set, I have all patient samples (69 rows total) assigned by my cluster analysis and clusters were labelled as column 3 "Cluster.assigned", 8 clusters in total, UNEQUAL size per cluster. Other columns contains variables, of that I want to test the numerical variables (such as Age) to see if anything is enriched compared to random by chance.

Now I'm hitting roadblocks due to my coding ability. But my idea is to see the real data as Observed, then shuffle the labels of clusters by using sample or permute function, like a Monte Carlo simulation, say 1000 times and call that simulated distribution as Expected.

Using Age column as an example:

#minimum dummy 30-row data
Patient.ID <-c("S3077497","S1041120","S162465","S563275","S2911623","S3117192","S2859024","S2088278","S3306185","S190789","S12146451","S2170842","S115594","S2024203","S1063872","S2914138","S303984","S570813","S2176683","S820460","S1235729","S3009401","S2590229","S629309","S120256","S2572773","S3180483","S3032079","S3217608","S5566943")

Cluster.assigned <- c("cluster1","cluster1","cluster1","cluster1","cluster1","cluster1","cluster1","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster3","cluster3","cluster3","cluster3","cluster3","cluster3","cluster3","cluster4","cluster4","cluster4")

Age <- c(61,80,78,69,57,70,60,59,72,82,66,68,70,62,82,80,67,77,74,77,74,74,64,70,74,64,54,73,58,87)

CLL_3S <-cbind(Patient.ID, Cluster.assigned, Age)

To see if there's any cluster that has patients enriched in certain age, the null hypothesis is there's no difference in age distribution across clusters. Now I should shuffle the patient labels or shuffle the Age data, say 1000 times, then I should have a simulated dataframe, from which I should be able to calculate the mean and standard deviation of the simulated (Expected)

#I image to use shuffle to permute 1000 times
#And combine the simulated into a massive dataframe
 shuffled <- numeric(length=1000)
 N <-nrows(CLL_3S)

 set.seed(123)
  for (i in seq_len(length(shuffled) -1)) {
      perm <- shuffle(N)
      .........

Next step is I will then use the actual observation of patient age in each cluster to calculate enrichment by using a Z score. Say obs (value - Expected Mean)/SD.

Once this process is automated, then I can apply this to other columns of interest and other datasets with different numbers of clusters. I have read something about the sample() and shuffle() but it doesnt really help me solving this particular problem...


Solution

  • I am not sure whether the code below meets your goal. If I understand your question correctly, what I should do is to shuffle cluster assignments only and then add a new column of z-score grouped by cluster labels.

    • sample makes the random shuffle
    • scale is used to calculate the z-score
    • ave helps to calculate the z-score by cluster labels
    • replicate is to run the simulation for multiple times
    replicate(1000,
      within(
        transform(CLL_3S,
          Cluster.assigned = Cluster.assigned[sample(1:nrow(CLL_3S))]
        ),
        zscore <- ave(Age, Cluster.assigned, FUN = scale)
      ),
      simplify = FALSE
    )
    

    Update

    If you just want to average the mean and sd over 1000 simulations, you can try the code below

    n <- 1000
    res <- Reduce(
      `+`,
      replicate(n,
        with(
          CLL_3S,
          do.call(rbind, tapply(Age, Cluster.assigned[sample(1:nrow(CLL_3S))], FUN = function(x) c(Mean = mean(x), Var = var(x))))
        ),
        simplify = FALSE
      )
    ) / n
    res <- within(as.data.frame(res), SD <- sqrt(Var))
    

    which gives

    > res
                 Mean      Var       SD
    cluster1 70.21086 68.99152 8.306114
    cluster2 70.06915 71.93188 8.481267
    cluster3 70.03571 70.19276 8.378112
    cluster4 70.12500 68.98867 8.305942
    

    Data

    > dput(CLL_3S)
    structure(list(Patient.ID = c("S3077497", "S1041120", "S162465", 
    "S563275", "S2911623", "S3117192", "S2859024", "S2088278", "S3306185",
    "S190789", "S12146451", "S2170842", "S115594", "S2024203", "S1063872",
    "S2914138", "S303984", "S570813", "S2176683", "S820460", "S1235729",
    "S3009401", "S2590229", "S629309", "S120256", "S2572773", "S3180483",
    "S3032079", "S3217608", "S5566943"), Cluster.assigned = c("cluster1",
    "cluster1", "cluster1", "cluster1", "cluster1", "cluster1", "cluster1", 
    "cluster2", "cluster2", "cluster2", "cluster2", "cluster2", "cluster2",
    "cluster2", "cluster2", "cluster2", "cluster2", "cluster2", "cluster2",
    "cluster2", "cluster3", "cluster3", "cluster3", "cluster3", "cluster3",
    "cluster3", "cluster3", "cluster4", "cluster4", "cluster4"), 
        Age = c(61, 80, 78, 69, 57, 70, 60, 59, 72, 82, 66, 68, 70,
        62, 82, 80, 67, 77, 74, 77, 74, 74, 64, 70, 74, 64, 54, 73,
        58, 87)), class = "data.frame", row.names = c(NA, -30L))