Search code examples
rtestingsample

Multiple hypothesis testing using replicate data in R


In the following data frame, I want to calculate p-values for each protein comparing the 'control' replicates and the 'treated' replicates. I am very new to using R and I just want to see if I can shift away from using Excel for tasks like these. In reality, I'll have thousands of proteins. I'll then use p.adjust() to correct for the multiple hypothesis testing.

I'd be very grateful for any advice.

Protein Control_1 Control_2 Control_3 Treated_1 Treated_2 Treated_3
1       1      7.15      7.16      7.11      6.91      6.88      6.92
2       2      6.64      6.61      6.59      6.37      6.35      6.41
3       3      3.68      3.78      3.81      2.40      2.09      2.17
4       4      5.04      5.01      4.69      3.43      3.52      3.66
5       5      6.92      6.81      6.90      7.12      7.21      7.27

Desired: -

Protein Control_1 Control_2 Control_3 Treated_1 Treated_2 Treated_3 P-value
1       1      7.15      7.16      7.11      6.91      6.88      6.92      0.000413
2       2      6.64      6.61      6.59      6.37      6.35      6.41      0.000742
3       3      3.68      3.78      3.81      2.40      2.09      2.17      0.001010
4       4      5.04      5.01      4.69      3.43      3.52      3.66      0.001262
5       5      6.92      6.81      6.90      7.12      7.21      7.27      0.004306

Solution

  • Updated with @StupidWolf's comment.

    Since you are new to R I am providing an easy to understand and modify solution.

    # Generate data that looks like yours
    df <-  data.frame(Protein=1:5,Control_1=rnorm(5,5),Control_2=rnorm(5,5),
               Control_3=rnorm(5,5),Treated_1=rnorm(5,5),Treated_2=rnorm(5,5),
               Treated_3=rnorm(5,5))
    p_vals <- rep(NA,nrow(df))
    for(i in 1:nrow(df)){
      i.p_val <- t.test(df[i,grep("Control",colnames(df))],
                        df[i,grep("Treated",colnames(df))])$p.value
      p_vals[i] <- i.p_val
    }
    df <- cbind(df,Pvalue=p_vals)
    df
    

    should give you

      Protein Control_1 Control_2 Control_3 Treated_1 Treated_2 Treated_3    Pvalue
    1       1  5.813581  5.149145  4.662203  5.481839  6.424654  5.503664 0.2621811
    2       2  4.191440  6.155372  5.773128  3.941712  5.945056  4.182457 0.4769504
    3       3  4.654504  4.598808  5.258675  4.101895  6.135411  4.276641 0.9993112
    4       4  5.426672  4.520739  6.293757  3.787395  5.274740  3.847900 0.1909877
    5       5  5.614929  6.993289  3.786346  5.193352  5.362928  4.746676 0.7353676
    

    You can change it from t.test() to other tests like non-parametric ones if you like.