Search code examples
rkolmogorov-smirnov

How to perform KS.test in a dataframe in R by groups


I want to perform two-sample Kolmogorov-Smirnov (KS) tests in R for an example dataframe "data" below:

    Protein1    Protein2    Protein3    Protein4    Protein5    Protein6    Protein7    Protein8    Group
Sample1 0.56    1.46    0.64    2.53    1.96    305.29  428.41  113.22  Control
Sample2 0.75    2.29    0.38    4.31    1.72    307.95  492.58  82.75   Control
Sample3 2.05    1.73    2.42    14.75   2.92    523.92  426.94  131.51  Control
Sample4 1.71    5.37    0.68    6.39    2.02    343.05  435.16  123.12  Control
Sample5 13.31   0.94    1.21    3.83    2.83    313.71  327.84  66.8    Control
Sample6 0.36    1.81    0.42    2.25    1.48    335.23  352.55  93.81   Control
Sample7 0.28    3.26    0.49    2.62    1.96    251.49  468.19  80.27   Control
Sample8 1.62    17.01   0.49    2.87    1.7 254.79  402.9   86.8    Control
Sample9 0.59    2.64    0.18    2.93    1.23    388.87  384.53  109.52  Control
Sample10    1.67    3.75    0.43    3.89    1.83    306.28  440.86  97.55   Control
Sample11    15.53   12.02   0.57    1.81    2.31    328.56  352.98  118.18  Control
Sample12    0.94    7.06    35.77   4.98    2.44    389.14  376.18  119.75  Control
Sample13    2.07    1.73    0.38    3.89    2.56    233.81  377.21  72.1    Control
Sample14    3.2 1.38    0.5 4.05    2.51    406.57  538.57  209.16  Patient
Sample15    1.33    2.17    0.46    3.31    1.72    278.04  276.37  79.4    Patient
Sample16    1.48    2.9 0.84    9.27    1.94    332.76  413.66  99.09   Patient
Sample17    2.02    1.96    0.25    4.16    1.96    358.73  383.63  107.46  Patient
Sample18    2.94    2   0.27    3.99    2.55    354.78  493.02  145.36  Patient
Sample19    1.01    8.1 0.35    3.65    1.62    335.18  264.74  145.15  Patient
Sample20    6.95    15.48   2.94    3.64    2.3 307.23  484.38  119.61  Patient
Sample21    0.52    1.38    0.56    3.08    1.86    244.3   304.74  76.87   Patient
Sample22    0.35    2.17    0.38    4.51    2.09    304.68  369.98  151.76  Patient
Sample23    2.26    2.9 0.3 4.44    2.43    302.51  367.51  150.69  Patient
Sample24    3.19    1.96    0.81    2.94    2.15    309.59  362.18  133.49  Patient
Sample25    1.12    2   0.71    3.77    2.42    334.36  358.9   131.35  Patient
Sample26    5.28    8.1 0.81    22.84   2.35    422.18  369.71  76.35   Patient

To perform a KS test between 2 individual columns, this is the code:

> ks.test(data$Protein1, data$Protein2, data=data)

    Two-sample Kolmogorov-Smirnov test

data:  data$Protein1 and data$Protein2
D = 0.42308, p-value = 0.01905
alternative hypothesis: two-sided

Warning message:
In ks.test(data$Protein1, data$Protein2, data = data) :
  cannot compute exact p-value with ties

However, I want to do this per column and by group. For example it is easy to do this for a t.test or wilcox.test as you can code either as t.test(y1,y2) or t.test(y~x) # where y is numeric and x is a binary factor But there is no code for ks.test when it comes to binary factor. Can anyone help with this?

Eventually, I want to do this for the whole dataframe for all proteins something as I could do successfully for a t-test, but would like to do for ks.test as follows:

t_tests <- lapply(
  data[, -1], # apply the function to every variable *other than* the first one (group)
  function(x) { t.test(x ~ HealthGroups, data = data) }
)

Any help appreciated. Thanking you in advance.


Solution

  • Here's a very simple way to do it. This uses a loop, which is typically frowned upon in R circles. However, it is very simple and self-explanatory, which can be a plus for newer users, and there isn't a problem with loops being too slow in a situation like this. (Note that you can work this up to use lapply() if you prefer, but that is still a loop, it just looks different on the outside.)

    Just create two new subsetted data frames with the same variables. Then loop over the data frames calling ks.test. The output isn't very user-friendly—it will just say j— so I add a call to ?writeLines to print the name of the variable being tested.

    # I am assuming the original data frame is called d
    dc <- d[d$Group=="Control",]
    dp <- d[d$Group=="Patient",]
    for(j in 1:8){  
      writeLines(names(dc)[j])
      print(ks.test(dc[,j], dp[,j]))  
    }
    # Protein1
    # 
    #   Two-sample Kolmogorov-Smirnov test
    # 
    # data:  dc[, j] and dp[, j]
    # D = 0.30769, p-value = 0.5882
    # alternative hypothesis: two-sided
    # 
    # Protein2
    # 
    #   Two-sample Kolmogorov-Smirnov test
    # 
    # data:  dc[, j] and dp[, j]
    # D = 0.23077, p-value = 0.8793
    # alternative hypothesis: two-sided
    # 
    # Protein3
    # 
    #   Two-sample Kolmogorov-Smirnov test
    # 
    # data:  dc[, j] and dp[, j]
    # D = 0.23077, p-value = 0.8793
    # alternative hypothesis: two-sided
    # 
    # Protein4
    # 
    #   Two-sample Kolmogorov-Smirnov test
    # 
    # data:  dc[, j] and dp[, j]
    # D = 0.46154, p-value = 0.1254
    # alternative hypothesis: two-sided
    # 
    # Protein5
    # 
    #   Two-sample Kolmogorov-Smirnov test
    # 
    # data:  dc[, j] and dp[, j]
    # D = 0.23077, p-value = 0.8793
    # alternative hypothesis: two-sided
    # 
    # Protein6
    # 
    #   Two-sample Kolmogorov-Smirnov test
    # 
    # data:  dc[, j] and dp[, j]
    # D = 0.15385, p-value = 0.9992
    # alternative hypothesis: two-sided
    # 
    # Protein7
    # 
    #   Two-sample Kolmogorov-Smirnov test
    # 
    # data:  dc[, j] and dp[, j]
    # D = 0.38462, p-value = 0.2999
    # alternative hypothesis: two-sided
    # 
    # Protein8
    # 
    #   Two-sample Kolmogorov-Smirnov test
    # 
    # data:  dc[, j] and dp[, j]
    # D = 0.46154, p-value = 0.1265
    # alternative hypothesis: two-sided
    # 
    # Warning messages:
    # 1: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
    # 2: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
    # 3: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
    # 4: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
    

    This will create a data frame to house the results:

    dk <- data.frame(Protein=character(8), D=numeric(8), p=numeric(8), stringsAsFactors=F)
    for(j in 1:8){  
      k <- ks.test(dc[,j], dp[,j])
      dk$Protein[j] <- names(dc)[j]
      dk$D[j]       <- k$statistic
      dk$p[j]       <- k$p.value
    }
    dk
    #    Protein         D         p
    # 1 Protein1 0.3076923 0.5881961
    # 2 Protein2 0.2307692 0.8793244
    # 3 Protein3 0.2307692 0.8793244
    # 4 Protein4 0.4615385 0.1253895
    # 5 Protein5 0.2307692 0.8793244
    # 6 Protein6 0.1538462 0.9992124
    # 7 Protein7 0.3846154 0.2999202
    # 8 Protein8 0.4615385 0.1264877