Search code examples
rcorrelation

How can I complete a correlation in R of one variable across it's factor levels, matching by date


I am trying to determine the correlation between different subsets of a variable (Concentration, below) based on factor levels - in this case, Lake=(A,B,C) - in order words, test the correlation between measurements of concentration at A against measurements of concentration at B, and then B against C, and A against C.

The issue is that the subsets based on factor are of different length, but I only want to include observations in the correlation which have an exact date match. I tried use='complete.obs' in the cor.test function hoping that would do the trick but it did not work.

res <- cor.test(Data$Concentration[Data$Lake=="A"], 
            Data$Concentration[Data$Lake=="B"], 
            use='complete.obs', 
            method = "pearson")

but I get

Error in cor.test.default(Data$Concentration[Data$Lake=="A"],  : 
  'x' and 'y' must have the same length

Tried searching but could not find a solution. Is this something that might be able to be solved with melt/reshape or perhaps there a simpler solution I am not seeing. Thank you.

Data below...

structure(list(Lake = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", 
"C"), class = "factor"), Date = structure(c(2L, 3L, 4L, 5L, 7L, 
8L, 9L, 1L, 3L, 4L, 6L, 7L, 2L, 3L, 4L, 6L, 7L), .Label = c("1970-04-06", 
"1970-04-07", "1970-04-28", "1970-05-04", "1970-05-14", "1970-05-15", 
"1970-05-28", "1970-05-29", "1970-05-30"), class = "factor"), 
    Concentration = c(10L, 20L, 30L, 40L, 50L, 50L, 50L, 100L, 
    200L, 280L, 410L, 500L, 1L, 3L, 8L, 90L, 1200L)), .Names = c("Lake", 
"Date", "Concentration"), class = "data.frame", row.names = c(NA, 
-17L))

Solution

  • If you just need the correlation, you can do something like:

    library(tidyr)
    data_wide = Data %>% pivot_wider(names_from="Lake",values_from="Concentration")
    data_wide
    
    # A tibble: 9 x 4
      Date           A     B     C
      <fct>      <int> <int> <int>
    1 1970-04-07    10    NA     1
    2 1970-04-28    20   200     3
    3 1970-05-04    30   280     8
    4 1970-05-14    40    NA    NA
    5 1970-05-28    50   500  1200
    6 1970-05-29    50    NA    NA
    7 1970-05-30    50    NA    NA
    8 1970-04-06    NA   100    NA
    9 1970-05-15    NA   410    90
    
    cor(data_wide[,-1],use="p")
              A         B         C
    A 1.0000000 0.9973327 0.8805841
    B 0.9973327 1.0000000 0.8014733
    C 0.8805841 0.8014733 1.0000000
    

    If you need correlation and p.values, like using cor.test, then a bit more coding:

    pw = combn(levels(Data$Lake),2)
    pw
         [,1] [,2] [,3]
    [1,] "A"  "A"  "B" 
    [2,] "B"  "C"  "C" 
    
    library(broom)
    library(dplyr)
    pairwise_c = apply(pw,2,function(i){
    tidy(cor.test(data_wide[[i[1]]],data_wide[[i[2]]])))
    })
    
    cbind(data.frame(t(pw)),bind_rows(pairwise_c))
    
      X1 X2  estimate statistic    p.value parameter
    1  A  B 0.9973327 13.663956 0.04650826         1
    2  A  C 0.8805841  2.627897 0.11941589         2
    3  B  C 0.8014733  1.895312 0.19852670         2
                                    method alternative   conf.low conf.high
    1 Pearson's product-moment correlation   two.sided         NA        NA
    2 Pearson's product-moment correlation   two.sided -0.5238283 0.9974832
    3 Pearson's product-moment correlation   two.sided -0.6948359 0.9956362