Search code examples
rfor-loopmatrixlapplykolmogorov-smirnov

R: loop performing ks tests across data frame stored in matrix


I apologize for strange syntax, I am just now learning to program. I have a df of 100 columns and 5304 rows. I need to perform separate two sided ks.tests on 94 of those last numeric columns (6:ncol(df)) using the 5th numeric column or reference column:

r<-df$rank. 

I'd also like to store the pvalues in a matrix. From what I understand, I can either use a 'for loop' or 'apply' functions. I have a simple code that only outputs a single stat summary (it seems like it is overwriting the results):

for (i in 6:ncol(df))
y<-df[,i]
ks.test(r,y)->K
> K

Two-sample Kolmogorov-Smirnov test

data:  r and y
D = 0.71983, p-value < 2.2e-16
alternative hypothesis: two-sided

I've tried many variations of this as well as using lapply wrong. Any insight as to why "K" should not return multiple values or assigning the output to a matrix? Thank you.

edit: sample data set

probe set   symbol  zscore  rank X1   X4 X13 X15 ....N (N=100)
22133-x_at  SP110   4.73635   1  400  14  5  1000
.                             2  5    430 56 150
.                             3  24   78  23 9000
...N
(N=5304)

Solution

  • Consider sapply to return a matrix of ks.test statistic and p.value:

    # RANDOM DATA TO DEMONSTRATE
    set.seed(147)
    df <- data.frame(id1 = sample(LETTERS, 5304, replace=TRUE),
                     id2 = sample(LETTERS, 5304, replace=TRUE),
                     id3 = sample(LETTERS, 5304, replace=TRUE),
                     id4 = sample(LETTERS, 5304, replace=TRUE),
                     setNames(lapply(5:100, function(i) rnorm(5304)),
                              paste0("Col", 5:100)))
    
    r <- df[,5]
    res <- sapply(df[,6:100], function(y) {
      ks <- ks.test(r, y)
      c(statistic=ks$statistic, p.value=ks$p.value)
      setNames(c(ks$statistic, ks$p.value), c("statistic", "p.value"))
    })
    
    # PRINT FIRST FIVE COLS
    res[,1:5]
    #                 Col6       Col7       Col8      Col9      Col10
    # statistic 0.02111614 0.01338612 0.01074661 0.0224359 0.01677979
    # p.value   0.18774138 0.72887906 0.91933648 0.1384762 0.44412866