I have a large sparse matrix (lets call it matrix
) in which the rows are the features and the columns are the samples. Each column/sample belong to 1 of 6 groups. I randomly sample some amount from each group and store what index they belong to in the original matrix.
astro_index <- Map(sample,row_index, num_sample)[1]
endo_index <- Map(sample,row_index, num_sample)[2]
micro_index <- Map(sample,row_index, num_sample)[3]
neuron_index <- Map(sample,row_index, num_sample)[4]
oligo_index <- Map(sample,row_index, num_sample)[5]
opc_index <- Map(sample,row_index, num_sample)[6]
The goal is to be able perform a Wilcox test and get the p-value on all the combination of the 6 groups for each features. The big issue is that I have over 30,000 features to test on all combination of 6 groups (so that is 15 comparisons for each of the 30,000+ features).
So I have two current methods. The first one uses the apply function and does it for only one comparison (here astro and neuron group). The disadvantage for this method is that I run into memory issues and it only does 1 comparison at a time. I would have to write this 14 more times to get all possible comparisons.
store_p <- apply(matrix,1,function(x) {wilcox.test(x[astro_index],x[neuron_index])$p.value })
The second method uses a for loop to go through all the features but I take advantage of the combn and the data frame to calculate the p-value for all combinations but one feature at a time. This method is really slow but does not crash.
for (i in features){
df <- data.frame('Astro' = matrix[i,astro_index], 'Endo' = matrix[i,endo_index], 'Micro' = matrix[i,micro_index], 'Neuron' = matrix[i,neuron_index], 'Oligo' = matrix[i,oligo_index], 'OPC' = matrix[i,opc_index])
result <- combn(names(df), 2, FUN = function(x) paste(paste(x, collapse='-'), wilcox.test(df[,x[1]], df[,x[2]], paired = TRUE)$p.value, sep=" : "))
hold_list <- append(hold_list, list(result))
}
To give a sense of what the result
looks like. Here is a sample output of result
> result
[1] "Astro-Endo : 0.115331575924872" "Astro-Micro : 0.935664046257304" "Astro-Neuron : 0.0271849565394441"
[4] "Astro-Oligo : 0.00147694402781699" "Astro-OPC : 0.0476580762532988" "Endo-Micro : 0.297672151508384"
[7] "Endo-Neuron : 2.38134038927696e-06" "Endo-Oligo : 0.0323129112432441" "Endo-OPC : 0.451258974150342"
[10] "Micro-Neuron : 0.000143621746738224" "Micro-Oligo : 0.0178171887595787" "Micro-OPC : 0.0692129715131915"
[13] "Neuron-Oligo : 6.68255453156116e-10" "Neuron-OPC : 6.201108273594e-07" "Oligo-OPC : 0.142213241936393"
I would ideally like the best of both world of both methods and do a more efficient process to compute these tests. Also if there is a suggestion to designing a different data frame all together to tackle this task in one way I would appreciate that too.
EDIT
I realized I did not make at as clear but the result
is only for one feature of all combinations. I have a for loop so that it goes through all the features. In essence, there should be a p-value calculated for all the feature and for all the combination.
I would use pairwiseWilcox
from scran
for that - that seems ideally suited for your problem.
It performs pairwise Wilcoxon rank sum tests for each row between groups of columns, where groups is a vector of column assignments.
Edit:
Example:
library(Matrix)
types <- c("Astro", "Neuron", "Endo", "Oligo", "OPC", "Micro")
# generate sparse matrix
set.seed(123)
mat <- Matrix(0, nrow = 10000, ncol = 1000, sparse = TRUE)
mat[sample(seq_along(mat), 1E5)] <- runif(n = 1e5, min = 0, max=100)
groups <- c(rep(types, each = floor(ncol(mat)/6)), rep("Micro", ncol(mat) %% 6))
colnames(mat) <- make.unique(groups)
# sample n=100 samples of each group
idx <- setNames(lapply(types, function(x) grep(x, colnames(mat))), types)
smp <- Map(sample, idx, size = 100)
groups <- gsub("[0-9]+", "", names(unlist(smp)))
# subset mat to sampled columns
mat <- mat[, unlist(smp, use.names = FALSE)]
library(scran)
pwt <- pairwiseWilcox(mat, groups = groups)
pwt
#> $statistics
#> $statistics[[1]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.49995 1.000000 1.000000
#> 2 0.51000 0.158341 0.616668
#> 3 0.49000 0.158341 0.616668
#> 4 0.50490 0.573540 0.856541
#> 5 0.48985 0.308851 0.616668
#> ... ... ... ...
#> 9996 0.4950 0.565662 0.856541
#> 9997 0.5050 0.322174 0.616668
#> 9998 0.4951 0.573540 0.856541
#> 9999 0.4950 0.322174 0.616668
#> 10000 0.5050 0.322174 0.616668
#>
#> $statistics[[2]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.5050 0.3221741 0.613045
#> 2 0.5049 0.5735395 0.858464
#> 3 0.4800 0.0444225 0.613045
#> 4 0.4947 0.6352736 0.948311
#> 5 0.4949 0.5578376 0.858464
#> ... ... ... ...
#> 9996 0.49500 0.565662 0.858464
#> 9997 0.50005 1.000000 1.000000
#> 9998 0.50500 0.322174 0.613045
#> 9999 0.50000 1.000000 1.000000
#> 10000 0.50500 0.322174 0.613045
#>
#> $statistics[[3]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.5050 0.322174 0.605697
#> 2 0.5001 0.995980 1.000000
#> 3 0.5000 1.000000 1.000000
#> 4 0.5100 0.158341 0.605697
#> 5 0.4949 0.557838 0.854499
#> ... ... ... ...
#> 9996 0.50005 1.000000 1.000000
#> 9997 0.49995 1.000000 1.000000
#> 9998 0.50005 1.000000 1.000000
#> 9999 0.49500 0.322174 0.605697
#> 10000 0.49995 1.000000 1.000000
#>
#> $statistics[[4]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.49995 1.000000 1.000000
#> 2 0.50010 0.995980 1.000000
#> 3 0.50000 1.000000 1.000000
#> 4 0.49490 0.648212 0.959177
#> 5 0.50500 0.322174 0.615026
#> ... ... ... ...
#> 9996 0.4949 0.557838 0.859750
#> 9997 0.4951 0.573540 0.859750
#> 9998 0.4852 0.182661 0.615026
#> 9999 0.5000 1.000000 1.000000
#> 10000 0.4949 0.557838 0.859750
#>
#> $statistics[[5]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.50500 0.322174 0.620334
#> 2 0.49480 0.641729 0.964426
#> 3 0.50000 1.000000 1.000000
#> 4 0.51000 0.158341 0.620334
#> 5 0.49995 1.000000 1.000000
#> ... ... ... ...
#> 9996 0.50005 1.000000 1.000000
#> 9997 0.49015 0.323442 0.620334
#> 9998 0.50005 1.000000 1.000000
#> 9999 0.49500 0.322174 0.620334
#> 10000 0.50500 0.322174 0.620334
#>
#> $statistics[[6]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.50005 1.000000 1.000000
#> 2 0.49000 0.158341 0.616668
#> 3 0.51000 0.158341 0.616668
#> 4 0.49510 0.573540 0.856541
#> 5 0.51015 0.308851 0.616668
#> ... ... ... ...
#> 9996 0.5050 0.565662 0.856541
#> 9997 0.4950 0.322174 0.616668
#> 9998 0.5049 0.573540 0.856541
#> 9999 0.5050 0.322174 0.616668
#> 10000 0.4950 0.322174 0.616668
#>
#> $statistics[[7]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.50500 0.322174 0.616668
#> 2 0.49500 0.322174 0.616668
#> 3 0.48960 0.392127 0.746909
#> 4 0.49005 0.318530 0.616668
#> 5 0.50500 0.654721 0.960283
#> ... ... ... ...
#> 9996 0.5001 0.995980 1.000000
#> 9997 0.4950 0.322174 0.616668
#> 9998 0.5100 0.158341 0.616668
#> 9999 0.5050 0.322174 0.616668
#> 10000 0.5000 1.000000 1.000000
#>
#> $statistics[[8]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.505 0.322174 0.604226
#> 2 0.490 0.158341 0.604226
#> 3 0.510 0.158341 0.604226
#> 4 0.505 0.322174 0.604226
#> 5 0.505 0.654721 0.952044
#> ... ... ... ...
#> 9996 0.50510 0.557838 0.849437
#> 9997 0.49500 0.322174 0.604226
#> 9998 0.50500 0.565662 0.849437
#> 9999 0.49995 1.000000 1.000000
#> 10000 0.49500 0.322174 0.604226
#>
#> $statistics[[9]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.49995 1.000000 1.000000
#> 2 0.49000 0.158341 0.611076
#> 3 0.51000 0.158341 0.611076
#> 4 0.49005 0.318530 0.611076
#> 5 0.51500 0.082748 0.611076
#> ... ... ... ...
#> 9996 0.5000 1.000000 1.000000
#> 9997 0.4900 0.158341 0.611076
#> 9998 0.4899 0.405995 0.762863
#> 9999 0.5050 0.322174 0.611076
#> 10000 0.4900 0.158341 0.611076
#>
#> $statistics[[10]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.50500 0.322174 0.619147
#> 2 0.48500 0.082748 0.619147
#> 3 0.51000 0.158341 0.619147
#> 4 0.50500 0.322174 0.619147
#> 5 0.50985 0.323442 0.619147
#> ... ... ... ...
#> 9996 0.50500 0.565662 0.863244
#> 9997 0.48500 0.082748 0.619147
#> 9998 0.50510 0.557838 0.863244
#> 9999 0.50005 1.000000 1.000000
#> 10000 0.50000 1.000000 1.000000
#>
#> $statistics[[11]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.4950 0.3221741 0.613045
#> 2 0.4951 0.5735395 0.858464
#> 3 0.5200 0.0444225 0.613045
#> 4 0.5053 0.6352736 0.948311
#> 5 0.5051 0.5578376 0.858464
#> ... ... ... ...
#> 9996 0.50500 0.565662 0.858464
#> 9997 0.49995 1.000000 1.000000
#> 9998 0.49500 0.322174 0.613045
#> 9999 0.50000 1.000000 1.000000
#> 10000 0.49500 0.322174 0.613045
#>
#> $statistics[[12]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.49500 0.322174 0.616668
#> 2 0.50500 0.322174 0.616668
#> 3 0.51040 0.392127 0.746909
#> 4 0.50995 0.318530 0.616668
#> 5 0.49500 0.654721 0.960283
#> ... ... ... ...
#> 9996 0.4999 0.995980 1.000000
#> 9997 0.5050 0.322174 0.616668
#> 9998 0.4900 0.158341 0.616668
#> 9999 0.4950 0.322174 0.616668
#> 10000 0.5000 1.000000 1.000000
#>
#> $statistics[[13]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.5000 1.0000000 1.000000
#> 2 0.4951 0.5735395 0.868341
#> 3 0.5200 0.0444225 0.625009
#> 4 0.5150 0.0827480 0.625009
#> 5 0.5000 1.0000000 1.000000
#> ... ... ... ...
#> 9996 0.50500 0.565662 0.868341
#> 9997 0.49995 1.000000 1.000000
#> 9998 0.49500 0.322174 0.625009
#> 9999 0.49500 0.322174 0.625009
#> 10000 0.49500 0.322174 0.625009
#>
#> $statistics[[14]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.49500 0.3221741 0.606038
#> 2 0.49510 0.5735395 0.852213
#> 3 0.52000 0.0444225 0.606038
#> 4 0.50005 1.0000000 1.000000
#> 5 0.51000 0.1583409 0.606038
#> ... ... ... ...
#> 9996 0.4998 0.9879417 1.000000
#> 9997 0.4951 0.5735395 0.852213
#> 9998 0.4800 0.0444225 0.606038
#> 9999 0.5000 1.0000000 1.000000
#> 10000 0.4900 0.1583409 0.606038
#>
#> $statistics[[15]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.50000 1.0000000 1.000000
#> 2 0.49005 0.3185296 0.619978
#> 3 0.52000 0.0444225 0.619978
#> 4 0.51500 0.0827480 0.619978
#> 5 0.50500 0.5656624 0.863114
#> ... ... ... ...
#> 9996 0.50500 0.565662 0.863114
#> 9997 0.49015 0.323442 0.619978
#> 9998 0.49500 0.322174 0.619978
#> 9999 0.49500 0.322174 0.619978
#> 10000 0.50000 1.000000 1.000000
#>
#> $statistics[[16]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.4950 0.322174 0.605697
#> 2 0.4999 0.995980 1.000000
#> 3 0.5000 1.000000 1.000000
#> 4 0.4900 0.158341 0.605697
#> 5 0.5051 0.557838 0.854499
#> ... ... ... ...
#> 9996 0.49995 1.000000 1.000000
#> 9997 0.50005 1.000000 1.000000
#> 9998 0.49995 1.000000 1.000000
#> 9999 0.50500 0.322174 0.605697
#> 10000 0.50005 1.000000 1.000000
#>
#> $statistics[[17]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.495 0.322174 0.604226
#> 2 0.510 0.158341 0.604226
#> 3 0.490 0.158341 0.604226
#> 4 0.495 0.322174 0.604226
#> 5 0.495 0.654721 0.952044
#> ... ... ... ...
#> 9996 0.49490 0.557838 0.849437
#> 9997 0.50500 0.322174 0.604226
#> 9998 0.49500 0.565662 0.849437
#> 9999 0.50005 1.000000 1.000000
#> 10000 0.50500 0.322174 0.604226
#>
#> $statistics[[18]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.5000 1.0000000 1.000000
#> 2 0.5049 0.5735395 0.868341
#> 3 0.4800 0.0444225 0.625009
#> 4 0.4850 0.0827480 0.625009
#> 5 0.5000 1.0000000 1.000000
#> ... ... ... ...
#> 9996 0.49500 0.565662 0.868341
#> 9997 0.50005 1.000000 1.000000
#> 9998 0.50500 0.322174 0.625009
#> 9999 0.50500 0.322174 0.625009
#> 10000 0.50500 0.322174 0.625009
#>
#> $statistics[[19]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.4950 0.322174 0.619978
#> 2 0.4999 0.995980 1.000000
#> 3 0.5000 1.000000 1.000000
#> 4 0.4850 0.082748 0.619978
#> 5 0.5100 0.158341 0.619978
#> ... ... ... ...
#> 9996 0.4949 0.557838 0.863504
#> 9997 0.4951 0.573540 0.863504
#> 9998 0.4850 0.176800 0.619978
#> 9999 0.5050 0.322174 0.619978
#> 10000 0.4949 0.557838 0.863504
#>
#> $statistics[[20]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.5000 1.000000 1.000000
#> 2 0.4947 0.635274 0.958759
#> 3 0.5000 1.000000 1.000000
#> 4 0.5000 1.000000 1.000000
#> 5 0.5050 0.565662 0.869131
#> ... ... ... ...
#> 9996 0.49995 1.000000 1.000000
#> 9997 0.49015 0.323442 0.625372
#> 9998 0.50005 1.000000 1.000000
#> 9999 0.50005 1.000000 1.000000
#> 10000 0.50500 0.322174 0.625372
#>
#> $statistics[[21]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.50005 1.000000 1.000000
#> 2 0.49990 0.995980 1.000000
#> 3 0.50000 1.000000 1.000000
#> 4 0.50510 0.648212 0.959177
#> 5 0.49500 0.322174 0.615026
#> ... ... ... ...
#> 9996 0.5051 0.557838 0.859750
#> 9997 0.5049 0.573540 0.859750
#> 9998 0.5148 0.182661 0.615026
#> 9999 0.5000 1.000000 1.000000
#> 10000 0.5051 0.557838 0.859750
#>
#> $statistics[[22]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.50005 1.000000 1.000000
#> 2 0.51000 0.158341 0.611076
#> 3 0.49000 0.158341 0.611076
#> 4 0.50995 0.318530 0.611076
#> 5 0.48500 0.082748 0.611076
#> ... ... ... ...
#> 9996 0.5000 1.000000 1.000000
#> 9997 0.5100 0.158341 0.611076
#> 9998 0.5101 0.405995 0.762863
#> 9999 0.4950 0.322174 0.611076
#> 10000 0.5100 0.158341 0.611076
#>
#> $statistics[[23]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.50500 0.3221741 0.606038
#> 2 0.50490 0.5735395 0.852213
#> 3 0.48000 0.0444225 0.606038
#> 4 0.49995 1.0000000 1.000000
#> 5 0.49000 0.1583409 0.606038
#> ... ... ... ...
#> 9996 0.5002 0.9879417 1.000000
#> 9997 0.5049 0.5735395 0.852213
#> 9998 0.5200 0.0444225 0.606038
#> 9999 0.5000 1.0000000 1.000000
#> 10000 0.5100 0.1583409 0.606038
#>
#> $statistics[[24]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.5050 0.322174 0.619978
#> 2 0.5001 0.995980 1.000000
#> 3 0.5000 1.000000 1.000000
#> 4 0.5150 0.082748 0.619978
#> 5 0.4900 0.158341 0.619978
#> ... ... ... ...
#> 9996 0.5051 0.557838 0.863504
#> 9997 0.5049 0.573540 0.863504
#> 9998 0.5150 0.176800 0.619978
#> 9999 0.4950 0.322174 0.619978
#> 10000 0.5051 0.557838 0.863504
#>
#> $statistics[[25]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.5050 0.322174 0.618555
#> 2 0.4947 0.635274 0.954724
#> 3 0.5000 1.000000 1.000000
#> 4 0.5150 0.082748 0.618555
#> 5 0.4950 0.322174 0.618555
#> ... ... ... ...
#> 9996 0.5051 0.557838 0.864937
#> 9997 0.4948 0.641729 0.961248
#> 9998 0.5152 0.171079 0.618555
#> 9999 0.4950 0.322174 0.618555
#> 10000 0.5100 0.158341 0.618555
#>
#> $statistics[[26]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.49500 0.322174 0.620334
#> 2 0.50520 0.641729 0.964426
#> 3 0.50000 1.000000 1.000000
#> 4 0.49000 0.158341 0.620334
#> 5 0.50005 1.000000 1.000000
#> ... ... ... ...
#> 9996 0.49995 1.000000 1.000000
#> 9997 0.50985 0.323442 0.620334
#> 9998 0.49995 1.000000 1.000000
#> 9999 0.50500 0.322174 0.620334
#> 10000 0.49500 0.322174 0.620334
#>
#> $statistics[[27]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.49500 0.322174 0.619147
#> 2 0.51500 0.082748 0.619147
#> 3 0.49000 0.158341 0.619147
#> 4 0.49500 0.322174 0.619147
#> 5 0.49015 0.323442 0.619147
#> ... ... ... ...
#> 9996 0.49500 0.565662 0.863244
#> 9997 0.51500 0.082748 0.619147
#> 9998 0.49490 0.557838 0.863244
#> 9999 0.49995 1.000000 1.000000
#> 10000 0.50000 1.000000 1.000000
#>
#> $statistics[[28]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.50000 1.0000000 1.000000
#> 2 0.50995 0.3185296 0.619978
#> 3 0.48000 0.0444225 0.619978
#> 4 0.48500 0.0827480 0.619978
#> 5 0.49500 0.5656624 0.863114
#> ... ... ... ...
#> 9996 0.49500 0.565662 0.863114
#> 9997 0.50985 0.323442 0.619978
#> 9998 0.50500 0.322174 0.619978
#> 9999 0.50500 0.322174 0.619978
#> 10000 0.50000 1.000000 1.000000
#>
#> $statistics[[29]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.5000 1.000000 1.000000
#> 2 0.5053 0.635274 0.958759
#> 3 0.5000 1.000000 1.000000
#> 4 0.5000 1.000000 1.000000
#> 5 0.4950 0.565662 0.869131
#> ... ... ... ...
#> 9996 0.50005 1.000000 1.000000
#> 9997 0.50985 0.323442 0.625372
#> 9998 0.49995 1.000000 1.000000
#> 9999 0.49995 1.000000 1.000000
#> 10000 0.49500 0.322174 0.625372
#>
#> $statistics[[30]]
#> DataFrame with 10000 rows and 3 columns
#> AUC p.value FDR
#> <numeric> <numeric> <numeric>
#> 1 0.4950 0.322174 0.618555
#> 2 0.5053 0.635274 0.954724
#> 3 0.5000 1.000000 1.000000
#> 4 0.4850 0.082748 0.618555
#> 5 0.5050 0.322174 0.618555
#> ... ... ... ...
#> 9996 0.4949 0.557838 0.864937
#> 9997 0.5052 0.641729 0.961248
#> 9998 0.4848 0.171079 0.618555
#> 9999 0.5050 0.322174 0.618555
#> 10000 0.4900 0.158341 0.618555
#>
#>
#> $pairs
#> DataFrame with 30 rows and 2 columns
#> first second
#> <character> <character>
#> 1 Astro Endo
#> 2 Astro Micro
#> 3 Astro Neuron
#> 4 Astro Oligo
#> 5 Astro OPC
#> ... ... ...
#> 26 OPC Astro
#> 27 OPC Endo
#> 28 OPC Micro
#> 29 OPC Neuron
#> 30 OPC Oligo
Created on 2020-06-18 by the reprex package (v0.3.0)
Edit #2:
Just to see what my approach would give you in about 1/20,000th of time (on my machine, at least) of the approach by StupidWolf, try this with his example mat
and group
:
set.seed(111)
celltypes = c("astro","endo","micro","neuron","oligo","opc")
mat = matrix(rnorm(10000*120),ncol=120)
colnames(mat) = paste0("cell",1:120)
rownames(mat) = paste0("gene",1:10000)
metadata = data.frame(celltype=rep(celltypes,each=20))
num_sample = 10
use_cols = tapply(1:nrow(metadata),metadata$celltype,sample,num_sample)
use_cols = unlist(use_cols)
group = metadata$celltype[use_cols]
library(scran)
library(data.table)
pwt <- pairwiseWilcox(mat[,use_cols], groups=group)
unique_comps <- !duplicated(t(apply(pwt$pairs, 1, sort)))
res <- rbindlist(setNames(lapply(pwt$statistics[unique_comps],
function(x) as.data.table(x, keep.rownames=TRUE)),
apply(pwt$pairs[unique_comps,], 1, paste, collapse = '_')),
idcol = "comparison")[, .(comparison, rn, p.value)]
setnames(res, "rn", "gene")
res[gene=="gene999"]
#> comparison gene p.value
#> 1: astro_endo gene999 0.1858767
#> 2: astro_micro gene999 0.5707504
#> 3: astro_neuron gene999 0.3846731
#> 4: astro_oligo gene999 0.4273553
#> 5: astro_opc gene999 0.3846731
#> 6: endo_micro gene999 0.4726756
#> 7: endo_neuron gene999 0.9097219
#> 8: endo_oligo gene999 0.6231762
#> 9: endo_opc gene999 0.6775850
#> 10: micro_neuron gene999 0.9097219
#> 11: micro_oligo gene999 0.9097219
#> 12: micro_opc gene999 0.9097219
#> 13: neuron_oligo gene999 0.8501067
#> 14: neuron_opc gene999 0.6775850
#> 15: oligo_opc gene999 0.9698500
Created on 2020-06-19 by the reprex package (v0.3.0)