From this data set, I have all patient samples (69 rows total) assigned by my cluster analysis and clusters were labelled as column 3 "Cluster.assigned", 8 clusters in total, UNEQUAL size per cluster. Other columns contains variables, of that I want to test the numerical variables (such as Age) to see if anything is enriched compared to random by chance.
Now I'm hitting roadblocks due to my coding ability. But my idea is to see the real data as Observed, then shuffle the labels of clusters by using sample or permute function, like a Monte Carlo simulation, say 1000 times and call that simulated distribution as Expected.
Using Age column as an example:
#minimum dummy 30-row data
Patient.ID <-c("S3077497","S1041120","S162465","S563275","S2911623","S3117192","S2859024","S2088278","S3306185","S190789","S12146451","S2170842","S115594","S2024203","S1063872","S2914138","S303984","S570813","S2176683","S820460","S1235729","S3009401","S2590229","S629309","S120256","S2572773","S3180483","S3032079","S3217608","S5566943")
Cluster.assigned <- c("cluster1","cluster1","cluster1","cluster1","cluster1","cluster1","cluster1","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster3","cluster3","cluster3","cluster3","cluster3","cluster3","cluster3","cluster4","cluster4","cluster4")
Age <- c(61,80,78,69,57,70,60,59,72,82,66,68,70,62,82,80,67,77,74,77,74,74,64,70,74,64,54,73,58,87)
CLL_3S <-cbind(Patient.ID, Cluster.assigned, Age)
To see if there's any cluster that has patients enriched in certain age, the null hypothesis is there's no difference in age distribution across clusters. Now I should shuffle the patient labels or shuffle the Age data, say 1000 times, then I should have a simulated dataframe, from which I should be able to calculate the mean and standard deviation of the simulated (Expected)
#I image to use shuffle to permute 1000 times
#And combine the simulated into a massive dataframe
shuffled <- numeric(length=1000)
N <-nrows(CLL_3S)
set.seed(123)
for (i in seq_len(length(shuffled) -1)) {
perm <- shuffle(N)
.........
Next step is I will then use the actual observation of patient age in each cluster to calculate enrichment by using a Z score. Say obs (value - Expected Mean)/SD.
Once this process is automated, then I can apply this to other columns of interest and other datasets with different numbers of clusters. I have read something about the sample() and shuffle() but it doesnt really help me solving this particular problem...
I am not sure whether the code below meets your goal. If I understand your question correctly, what I should do is to shuffle cluster assignments only and then add a new column of z-score grouped by cluster labels.
sample
makes the random shufflescale
is used to calculate the z-scoreave
helps to calculate the z-score by cluster labelsreplicate
is to run the simulation for multiple timesreplicate(1000,
within(
transform(CLL_3S,
Cluster.assigned = Cluster.assigned[sample(1:nrow(CLL_3S))]
),
zscore <- ave(Age, Cluster.assigned, FUN = scale)
),
simplify = FALSE
)
Update
If you just want to average the mean and sd over 1000 simulations, you can try the code below
n <- 1000
res <- Reduce(
`+`,
replicate(n,
with(
CLL_3S,
do.call(rbind, tapply(Age, Cluster.assigned[sample(1:nrow(CLL_3S))], FUN = function(x) c(Mean = mean(x), Var = var(x))))
),
simplify = FALSE
)
) / n
res <- within(as.data.frame(res), SD <- sqrt(Var))
which gives
> res
Mean Var SD
cluster1 70.21086 68.99152 8.306114
cluster2 70.06915 71.93188 8.481267
cluster3 70.03571 70.19276 8.378112
cluster4 70.12500 68.98867 8.305942
Data
> dput(CLL_3S)
structure(list(Patient.ID = c("S3077497", "S1041120", "S162465",
"S563275", "S2911623", "S3117192", "S2859024", "S2088278", "S3306185",
"S190789", "S12146451", "S2170842", "S115594", "S2024203", "S1063872",
"S2914138", "S303984", "S570813", "S2176683", "S820460", "S1235729",
"S3009401", "S2590229", "S629309", "S120256", "S2572773", "S3180483",
"S3032079", "S3217608", "S5566943"), Cluster.assigned = c("cluster1",
"cluster1", "cluster1", "cluster1", "cluster1", "cluster1", "cluster1",
"cluster2", "cluster2", "cluster2", "cluster2", "cluster2", "cluster2",
"cluster2", "cluster2", "cluster2", "cluster2", "cluster2", "cluster2",
"cluster2", "cluster3", "cluster3", "cluster3", "cluster3", "cluster3",
"cluster3", "cluster3", "cluster4", "cluster4", "cluster4"),
Age = c(61, 80, 78, 69, 57, 70, 60, 59, 72, 82, 66, 68, 70,
62, 82, 80, 67, 77, 74, 77, 74, 74, 64, 70, 74, 64, 54, 73,
58, 87)), class = "data.frame", row.names = c(NA, -30L))