Search code examples
rdataframecorrelation

How do I ignore double zeros before calculating correlation between two data frames in R


I have two data frames with same number of columns (100: samples) and rows (9600: genes). These two data frames are output from two different program and I would like to calculate the correlation between them.

My example datasets:

df1 <-data.frame(Sample1 =c(0.52,2.5,8.3,10.5,5.3),Sample2=c(0,0,2,1,0), Sample3=c(0,12,13,14,0))
rownames(df1)<-c("KO1","KO2","KO3","KO4","KO5")
df2<- data.frame(Sample1=c(1,2,3,4,5),Sample2=c(0,0,8,9,0),Sample3=c(0,12,13,14,0))
rownames(df2)<-c("KO1","KO2","KO3","KO4","KO5")
df<-data.frame(df1,df2)

>df1
      Sample1 Sample2 Sample3
KO1    0.52       0       0
KO2    2.50       0      12
KO3    8.30       2      13
KO4   10.50       1      14
KO5    5.30       0       0

>df2
      Sample1 Sample2 Sample3
KO1       1       0       0
KO2       2       0      12
KO3       3       8      13
KO4       4       9      14
KO5       5       0       0

While calculating correlation, I wanted to remove entries which have zero in both the data frame.For example, For the sample 1, every row should be included, but for sample2, KO1, KO2 and KO5 should not be included, likewise for sample3, KO1 and KO5 should not be included. here, I calculating column-wise correlation between two dataframe.
I tried the following code:

output_without_zero<- with(subset(df, !(df1 == 0 & df2 == 0)), cor(df1,df2,method = "spearman"))
output_with_zero<- cor(df1,df2,method = "spearman")

I expected that removing zero from correlation should differ than including them. But I got the same correlation matrix for both of them. How to get the desire output?

Thank you in advance


Solution

  • Replace zero with NA

    df1[df1 == 0] <- NA
    df2[df2 == 0] <- NA
    

    By complete.obs

    cor(df1, df2, method = "spearman",  use = "complete.obs")
            Sample1 Sample2 Sample3
    Sample1       1       1       1
    Sample2      -1      -1      -1
    Sample3       1       1       1
    

    By pairwise.complete.obs

    cor(df1, df2, method = "spearman",  use = "pairwise.complete.obs")
            Sample1 Sample2 Sample3
    Sample1     0.7       1       1
    Sample2    -1.0      -1      -1
    Sample3     1.0       1       1