Search code examples
rcorrelationdata-management

Getting different cor() output on same data?


I have a problem when calculating correlations. I have two datasets: d3: One with 1259 observations and 264 variables. d4: One with 1196 observations and 185 variables - these are only blood tests. Both data have same unique ID for participants, so that they can be merged.

When merging (called: d) I have 1150 observations (because them without bloodtests are out, and the bloodtest data had many control samples and so on. When remowing participants without information x (since this is inclusion criteria) we end up with 1144 observations.

In d4 I had two samples with missing values in all the variables. They are included in the 1144 participants.

So after sum datamanagement I want to calculate correlations between informations x (we call it Edu) and all the blood samples (184).

I make a new dataset by subsetting from d:

dbio <- d %>%
  select(Edu, BMP_6:CCL16)

I use these codes: First two functions:

cor.prob <- function (X, dfr = nrow(X) - 2) {
  R <- cor(X, use="complete.obs", method = c("pearson"))
  above <- row(R) < col(R)
  r2 <- R[above]^2
  Fstat <- r2 * dfr/(1 - r2)
  R[above] <- 1 - pf(Fstat, 1, dfr)
  R[row(R) == col(R)] <- NA
  R
}

flattenSquareMatrix <- function(m) {
  if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.") 
  if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.")
  ut <- upper.tri(m)
  data.frame(i = rownames(m)[row(m)[ut]],
             j = rownames(m)[col(m)[ut]],
             cor=t(m)[ut],
             p=m[ut])
}

Then I start making correlations:

kor_masterlist_dbio <- flattenSquareMatrix (cor.prob(dbio))

kor_masterlist_dbio_ordnet <- kor_masterlist_dbio[order(-abs(kor_masterlist_dbio$cor)),]

kor_masterlist_dbio$threshold <- as.factor(kor_masterlist_dbio$p < 0.05)

kor_masterlist_dbio_Edu <- subset(kor_masterlist_dbio_ordnet, i == "Edu" & j != "ID")

kor_masterlist_dbio_Edu$threshold <- as.factor(kor_masterlist_dbio_Edu$p < 0.05)


kor_masterlist_dbio_Edu[kor_masterlist_dbio_Edu$threshold==T,]

kor_masterlist_dbio_Edu

Now the question: I use "complete.obs" in cor(). The two participants with all blood test missing(NA), if i take them out of the data by following code:

d <- d[rowSums(is.na(d[,3:6]))!=4,]
And end up with d: n = 1142.

I then get different results from the correlations. I end up with two less blood test as signifcant, as in p-value above 0.05. I dont understand the difference when the values are missing for those two participants.

When using "pariwise" ind cor() i get same results with 1142 or 1144 participants. But the last 4 blood tests are different from them I get with "complete.obs"

And the rest with slightly different correlationcoefficients and p-values. But stille same ranking.

I hope someone can help me. I should have same results wether the to participants are included or not.

I've tried to make following small reproducible example that shows my problem. You will get different results if you run it with/without:

d <- d[rowSums(is.na(d[,3:6]))!=4,]
ID <- c(1,2,3,4,5,6,7,8,9,10)
Edu <- c(4,7,9,13,15,18,11,10,12,8)
CLL <- c(NA,0.33,0.45,NA,0.76,0.34,0.49,0.86,0.43,0.26)
BLL <- c(NA,0.93,0.15,NA,0.86,0.14,0.29,0.36,0.93,0.76)
ALL <- c(NA,0.53,0.65,NA,0.26,0.54,0.99,0.76,0.63,NA)
DLL <- c(NA,0.13,0.95,NA,0.36,0.74,0.19,NA,0.83,0.56)

d <- data.frame(ID, Edu, CLL, BLL, ALL, DLL)
d <- d[rowSums(is.na(d[,3:6]))!=4,]

cor.prob <- function (X, dfr = nrow(X) - 2) {
  R <- cor(X, use="complete.obs", method = c("pearson"))
  above <- row(R) < col(R)
  r2 <- R[above]^2
  Fstat <- r2 * dfr/(1 - r2)
  R[above] <- 1 - pf(Fstat, 1, dfr)
  R[row(R) == col(R)] <- NA
  R
}

flattenSquareMatrix <- function(m) {
  if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.") 
  if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.")
  ut <- upper.tri(m)
  data.frame(i = rownames(m)[row(m)[ut]],
             j = rownames(m)[col(m)[ut]],
             cor=t(m)[ut],
             p=m[ut])
}

kor_masterlist_d <- flattenSquareMatrix (cor.prob(d))
kor_masterlist_d_order <- kor_masterlist_d[order(-abs(kor_masterlist_d$cor)),]
kor_masterlist_d_Edu <- subset(kor_masterlist_d_order, i == "Edu" & j != "ID")
kor_masterlist_d_Edu
#>      i   j        cor         p
#> 8  Edu ALL -0.3319602 0.4217894
#> 3  Edu CLL  0.2646661 0.5264383
#> 12 Edu DLL  0.2609108 0.5325405
#> 5  Edu BLL -0.2492912 0.5515835

Created on 2021-07-09 by the reprex package (v2.0.0)

ID <- c(1,2,3,4,5,6,7,8,9,10)
Edu <- c(4,7,9,13,15,18,11,10,12,8)
CLL <- c(NA,0.33,0.45,NA,0.76,0.34,0.49,0.86,0.43,0.26)
BLL <- c(NA,0.93,0.15,NA,0.86,0.14,0.29,0.36,0.93,0.76)
ALL <- c(NA,0.53,0.65,NA,0.26,0.54,0.99,0.76,0.63,NA)
DLL <- c(NA,0.13,0.95,NA,0.36,0.74,0.19,NA,0.83,0.56)

d <- data.frame(ID, Edu, CLL, BLL, ALL, DLL)

cor.prob <- function (X, dfr = nrow(X) - 2) {
  R <- cor(X, use="complete.obs", method = c("pearson"))
  above <- row(R) < col(R)
  r2 <- R[above]^2
  Fstat <- r2 * dfr/(1 - r2)
  R[above] <- 1 - pf(Fstat, 1, dfr)
  R[row(R) == col(R)] <- NA
  R
}

flattenSquareMatrix <- function(m) {
  if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.") 
  if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.")
  ut <- upper.tri(m)
  data.frame(i = rownames(m)[row(m)[ut]],
             j = rownames(m)[col(m)[ut]],
             cor=t(m)[ut],
             p=m[ut])
}

kor_masterlist_d <- flattenSquareMatrix (cor.prob(d))
kor_masterlist_d_order <- kor_masterlist_d[order(-abs(kor_masterlist_d$cor)),]
kor_masterlist_d_Edu <- subset(kor_masterlist_d_order, i == "Edu" & j != "ID")
kor_masterlist_d_Edu
#>      i   j        cor         p
#> 8  Edu ALL -0.3319602 0.3487063
#> 3  Edu CLL  0.2646661 0.4599218
#> 12 Edu DLL  0.2609108 0.4665494
#> 5  Edu BLL -0.2492912 0.4873203

Created on 2021-07-09 by the reprex package (v2.0.0)


Solution

  • The difference is introduced with

      Fstat <- r2 * dfr/(1 - r2)
    

    due to different values of dfr (8 and 6).

    > r2
     [1] 0.234375000 0.011456074 0.070048129 0.002401998 0.062146106 0.062751663
     [7] 0.096834764 0.110197604 0.165400138 0.216547595 0.057255529 0.068074453
    [13] 0.030140179 0.136955494 0.005027238
    > r2*8/(1-r2)
     [1] 2.44897959 0.09271069 0.60259574 0.01926226 0.53011332 0.53562464
     [7] 0.85773686 0.99076023 1.58543173 2.21121379 0.48586255 0.58437675
    [13] 0.24861473 1.26951037 0.04042111
    > r2*6/(1-r2)
     [1] 1.83673469 0.06953302 0.45194680 0.01444669 0.39758499 0.40171848
     [7] 0.64330264 0.74307017 1.18907380 1.65841034 0.36439691 0.43828256
    [13] 0.18646105 0.95213278 0.03031583
    

    Neither of 8 or 6 is correct, since it's based on nrow(X), while for use="complete.obs" it has to be based on the number of complete observations. This can be accomplished by changing the function definition to cor.prob <- function (X, dfr = sum(complete.cases(X)) - 2) { …. Therewith, the same results are produced with and without the prior
    d <- d[rowSums(is.na(d[,3:6]))!=4,]

    But if I choose to use pairwise.complete.obs instead of complete.obs, then i'll keep the code as it was? And further, since i have NA values different places in my data then i will benefite from "pairwise" rather than "complete.obs"?

    Indeed if we use pairwise.complete.obs, we gain the observations where only part of the columns are NA. But, since we then have different numbers of observations for the individual columns, a single dfr value is not appropriate; instead, we can use a dfr matrix:

    library(psych)
    cor.prob <- function (X, dfr = pairwiseCount(X) - 2)
    {
      …
      Fstat <- r2 * dfr[above]/(1 - r2)
      R[above] <- 1 - pf(Fstat, 1, dfr[above])
      …