Search code examples
rcorrelationp-value

Obtaining a matrix of p-values of a pearson correlation matrix


Dear StackOverflow Community:

I am trying to make a matrix of p-values that corresponds to a matrix I have obtained by obtaining correlation values

My data is the following (just doing 5 rows for simplicity, my real data is 3 columns for each data frame with 50 rows).

FG_Smooth <- data.frame(FS_1 = c(0.43, 0.33, 3.47, 5.26, 1.09), FS2 = c(0.01, 0.02, 6.86, 3.27, 0.86), FS_3 = c(0.07, 0.36, 1.91, 5.61, 0.84), row.names = c("Group_3", "Thermo", "Embryophyta", "Flavo", "Cyclo")) 
FMG_Smooth <- data.frame(GS_1 = c(1.13, 1.20, 0.52, 2.81, 0.70), GS_2 = c(1.18, 1.7, 0.42, 2.93, 0.78), GS_3 = c(1.17, 1.11, 0.60, 3.10, 0.87), row.names = c("Proline", "Trigonelline", "L-Lysine", "Nioctine", "Caffeate"))                         

library(Hmisc)
rcorr(t(FG_Smooth), t(FMG_Smooth), type = "pearson")

But I get this error:

Error in rcorr(t(FG_Smooth), t(FMG_Smooth), type = "pearson") : must have >4 observations

I only have 3 biological samples of each - so I am unable to use the rcorr command that has been suggested mulitple time in multiple posts. The rcorr command gives you 1) the matrix of correlations; and 2) p-values for the correlations.

So, to side-step this issue, I have ran the following: as suggested in other posts:

library(stats)
cor(t(FG_Smooth), t(FMG_Smooth), method = "pearson")

This works and gives a matrix of all of my correlations.

My next step is to find the p-values associated with each value in the correlation matrix. The function cor.test only gives an overall p-value, which isn't what I need.

After perusing multiple posts - I ran across this one: rcorr() function for correlations

I followed directions to code given:

tblcols <- expand.grid(1:ncol(FG_Smooth), 1:ncol(FMG_Smooth))
cfunc <- function(var1, var2) {
  cor.test(FG_Smooth[,var1], FMG_Smooth[,var2], method="pearson")
}

res <- mapply(function(a,b) {
  cfunc(var1 = a, var2 = b)
}, tblcols$Var1, tblcols$Var2)

head(res)
[,1]        [,2]        [,3]        [,4]        [,5]        [,6]        [,7]       
statistic   1.324125    -0.1022017  2.422883    0.9131595   -0.3509424  1.734178    1.53494    
parameter   3           3           3           3           3           3           3          
p.value     0.2773076   0.9250449   0.09392613  0.4284906   0.74883     0.1812997   0.2223626  
estimate    0.6073388   -0.05890371 0.8135079   0.4663678   -0.1985814  0.7075406   0.663238   
null.value  0           0           0           0           0           0           0          
alternative "two.sided" "two.sided" "two.sided" "two.sided" "two.sided" "two.sided" "two.sided"
            [,8]         [,9]       
statistic   -0.009291327 2.880821   
parameter   3            3          
p.value     0.99317      0.06348644 
estimate    -0.005364273 0.8570256  
null.value  0            0          
alternative "two.sided"  "two.sided"

This only gives me 9 p-values and not a matrix of p-values that corresponds to each correlation value obtained with the cor command. For this example, it would be a 5x5 matrix of p-values, since the cor command results in a 5x5 matrix of correlation values.

Is there a diff. way to do this?


Solution

  • Here's a tidyverse solution that creates all pairs of interest and then performs a cor.test for each pair and extracts the correlation value and the corresponding p value.

    # example data
    FG_Smooth <- data.frame(FS_1 = c(0.43, 0.33, 3.47, 5.26, 1.09), FS2 = c(0.01, 0.02, 6.86, 3.27, 0.86), FS_3 = c(0.07, 0.36, 1.91, 5.61, 0.84), row.names = c("Group_3", "Thermo", "Embryophyta", "Flavo", "Cyclo")) 
    FMG_Smooth <- data.frame(GS_1 = c(1.13, 1.20, 0.52, 2.81, 0.70), GS_2 = c(1.18, 1.7, 0.42, 2.93, 0.78), GS_3 = c(1.17, 1.11, 0.60, 3.10, 0.87), row.names = c("Proline", "Trigonelline", "L-Lysine", "Nioctine", "Caffeate"))                         
    
    library(tidyverse)
    
    expand.grid(v1 = row.names(FG_Smooth),                                # create combinations of names
                v2 = row.names(FMG_Smooth)) %>%
      tbl_df() %>%                                                        # use for visualisation purpose
      mutate(cor_test = map2(v1, v2, ~cor.test(unlist(FG_Smooth[.x,]),    # perform the correlation test for each pair and store it
                                               unlist(FMG_Smooth[.y,]))), 
             cor_value = map_dbl(cor_test, "estimate"),                   # get the correlation value from the test
             cor_p_value = map_dbl(cor_test, "p.value"))                  # get the p value from the test
    
    # # A tibble: 25 x 5
    #   v1          v2           cor_test    cor_value cor_p_value
    #   <fct>       <fct>        <list>          <dbl>       <dbl>
    # 1 Group_3     Proline      <S3: htest>    -0.998     0.0367 
    # 2 Thermo      Proline      <S3: htest>    -0.592     0.596  
    # 3 Embryophyta Proline      <S3: htest>     0.390     0.745  
    # 4 Flavo       Proline      <S3: htest>    -0.544     0.634  
    # 5 Cyclo       Proline      <S3: htest>    -0.966     0.167  
    # 6 Group_3     Trigonelline <S3: htest>    -0.492     0.673  
    # 7 Thermo      Trigonelline <S3: htest>    -0.998     0.0396 
    # 8 Embryophyta Trigonelline <S3: htest>     0.985     0.109  
    # 9 Flavo       Trigonelline <S3: htest>    -1.000     0.00188
    #10 Cyclo       Trigonelline <S3: htest>    -0.305     0.803  
    # # ... with 15 more rows
    

    v1 and v2 are the row names of your datasets that will create the pairs for the correlation tests, cor_test column has the correlation test object for each pair, cor_value has the extracted correlation coefficient and cor_p_value has the extracted p value.

    If you save the above output as a data frame you can easily reshape. For example if you save it as d you can get a 5x5 data frame of p values like this:

    d %>%
      select(v1, v2, cor_p_value) %>%
      spread(v2, cor_p_value)
    
    # # A tibble: 5 x 6
    #   v1          Caffeate `L-Lysine` Nioctine Proline Trigonelline
    #   <fct>          <dbl>      <dbl>    <dbl>   <dbl>        <dbl>
    # 1 Cyclo          0.309     0.995     0.351  0.167       0.803  
    # 2 Embryophyta    0.779     0.0931    0.737  0.745       0.109  
    # 3 Flavo          0.890     0.204     0.848  0.634       0.00188
    # 4 Group_3        0.439     0.875     0.481  0.0367      0.673  
    # 5 Thermo         0.928     0.242     0.886  0.596       0.0396 
    

    An alternative version using broom package as well would be:

    library(tidyverse)
    library(broom)
    
    expand.grid(v1 = row.names(FG_Smooth),                     
                v2 = row.names(FMG_Smooth)) %>%
      tbl_df() %>%
      mutate(cor_test = map2(v1, v2, ~tidy(cor.test(unlist(FG_Smooth[.x,]),    
                                                    unlist(FMG_Smooth[.y,]))))) %>%
      unnest()
    
    # # A tibble: 25 x 8
    #   v1          v2           estimate statistic p.value parameter method                               alternative
    #   <fct>       <fct>           <dbl>     <dbl>   <dbl>     <int> <chr>                                <chr>      
    # 1 Group_3     Proline        -0.998   -17.3   0.0367          1 Pearson's product-moment correlation two.sided  
    # 2 Thermo      Proline        -0.592    -0.735 0.596           1 Pearson's product-moment correlation two.sided  
    # 3 Embryophyta Proline         0.390     0.423 0.745           1 Pearson's product-moment correlation two.sided  
    # 4 Flavo       Proline        -0.544    -0.648 0.634           1 Pearson's product-moment correlation two.sided  
    # 5 Cyclo       Proline        -0.966    -3.73  0.167           1 Pearson's product-moment correlation two.sided  
    # 6 Group_3     Trigonelline   -0.492    -0.565 0.673           1 Pearson's product-moment correlation two.sided  
    # 7 Thermo      Trigonelline   -0.998   -16.0   0.0396          1 Pearson's product-moment correlation two.sided  
    # 8 Embryophyta Trigonelline    0.985     5.78  0.109           1 Pearson's product-moment correlation two.sided  
    # 9 Flavo       Trigonelline   -1.000  -339.    0.00188         1 Pearson's product-moment correlation two.sided  
    #10 Cyclo       Trigonelline   -0.305    -0.320 0.803           1 Pearson's product-moment correlation two.sided  
    # # ... with 15 more rows
    

    which gives you a tidy format of the correlation test object. You need to use columns estimate (correlation coefficient) and p.value.