Search code examples
rmachine-learningt-testfeature-selection

t-stat for feature selection


I want to calculate t-Statistic for feature selection in R with for loop. Data has 155 columns and dependent variable is binary (mutagen - nonmutagen). I would like to assign a t-stat for every columns. The problem is I couldn't figure out how can I write it.

Here is the formula I'm trying to implement in R:

enter image description here

Also I wrote a code but I'm not sure about it and it's just for first column. I need to write it in for loop for all columns.

abs(diff(tapply(train_df[,1], train_df$Activity, mean))) / sqrt(sd((train_df$NEG_01_NEG[train_df$Activity == "mutagen"])^2) / (length(train_df$NEG_01_NEG[train_df$Activity == "mutagen"])) + 
   sd((train_df$NEG_01_NEG[train_df$Activity != "mutagen"])^2) / (length(train_df$NEG_01_NEG[train_df$Activity != "mutagen"])))

Thanks in advance!


Solution

  • If you don't want to worry about speed (and with 155 columns you probably don't care) you can use the t.test function and apply it to every column.

    Simulate some data first

    set.seed(1)
    DF <- data.frame(y=rep(1:2, 50), x1=rnorm(100), x2=rnorm(100), x3=rnorm(100))
    head(DF)
    
      y         x1          x2         x3
    1 1 -0.6264538 -0.62036668  0.4094018
    2 2  0.1836433  0.04211587  1.6888733
    3 1 -0.8356286 -0.91092165  1.5865884
    4 2  1.5952808  0.15802877 -0.3309078
    5 1  0.3295078 -0.65458464 -2.2852355
    6 2 -0.8204684  1.76728727  2.4976616
    

    Then we can apply the t.test function to all but the first column using the formula argument.

    group <- DF$y
    lapply(DF[,-1], function(x) { t.test(x ~ group)$statistic })
    

    which returns the test statistic for each column.

    t.test computes a lot of extra information that you don't need so you can speed this up substantially by doing the computations directly, but it really isn't necessary here