Search code examples
rdataframevenn-diagram

R - Conditionally summarize data from all possible column pairs


I have a table that lists the presence/absence of each organism across several different conditions. My goal is to generate a new table that lists the values for all possible Venn Diagrams for each pair of organisms.

...put another way: for each pair of organisms, I want a table summarizing:

  1. the number of conditions that they share (organism1 == 1 & organism2 == 1)
  2. the number of conditions unique to organism1 (organism1 == 1 & organism2 == 0)
  3. the number of conditions unique to organism2 (organism1 == 0 & organism2 == 1)

My current method is below, though my real Presence/Absence table is much larger, so it'd be great if there's a more concise way to automate this! (i.e. a for-loop?!)

Example Presence/Absence Table (rows=conditions, columns=organisms):

paData <- data.table(
  Pyro = c(1,1,0,0,1,0,1),
  Anth = c(0,1,0,1,0,1,1),
  Tric = c(1,1,0,1,0,1,1))
 
paData
   Pyro Anth Tric
1:    1    0    1
2:    1    1    1
3:    0    0    0
4:    0    1    1
5:    1    0    0
6:    0    1    1
7:    1    1    1

For each pair of organisms (columns) designate whether one, both, or neither organism was present in each condition (row):

paData$PyroAnth <- ifelse(paData[,1] ==1 & 
                            paData[,2] ==0, "V1alone",
                        ifelse(paData[,1] ==1 & 
                                 paData[,2] ==1, "Overlap",
                               ifelse(paData[,1] ==0 & 
                                        paData[,2] ==1, "V2alone", 
                                            "NA")))

paData$PyroTric <- ifelse(paData[,1] ==1 & 
                           paData[,3] ==0, "V1alone",
                       ifelse(paData[,1] ==1 & 
                                paData[,3] ==1, "Overlap",
                              ifelse(paData[,1] ==0 & 
                                       paData[,3] ==1, "V2alone", 
                                     "NA")))

paData$AnthTric <- ifelse(paData[,2] ==1 & 
                           paData[,3] ==0, "V1alone",
                         ifelse(paData[,2] ==1 & 
                                  paData[,3] ==1, "Overlap",
                                ifelse(paData[,2] ==0 & 
                                         paData[,3] ==1, "V2alone", 
                                       "NA")))

paData
   Pyro Anth Tric PyroAnth PyroTric AnthTric
1:    1    0    1  V1alone  Overlap  V2alone
2:    1    1    1  Overlap  Overlap  Overlap
3:    0    0    0       NA       NA       NA
4:    0    1    1  V2alone  V2alone  Overlap
5:    1    0    0  V1alone  V1alone       NA
6:    0    1    1  V2alone  V2alone  Overlap
7:    1    1    1  Overlap  Overlap  Overlap

Create desired output table -- Count the number of conditions (rows) where, for each pair of organisms; each organism was present either "alone" or where its presence "overlapped" with the presence of the second organism.

DesiredOutput <- data.frame(rbind(list(names(paData[,1]), names(paData[,2]),
                                       nrow(paData[PyroAnth == "V1alone"]),
                                       nrow(paData[PyroAnth == "Overlap"]),
                                       nrow(paData[PyroAnth == "V2alone"])),
                                  list(names(paData[,1]), names(paData[,3]),
                                       nrow(paData[PyroTri == "V1alone"]),
                                       nrow(paData[PyroTri == "Overlap"]),
                                       nrow(paData[PyroTri == "V2alone"])),
                                  list(names(paData[,2]), names(paData[,3]),
                                       nrow(paData[AnthTri == "V1alone"]),
                                       nrow(paData[AnthTri == "Overlap"]),
                                       nrow(paData[AnthTri == "V2alone"]))))

colnames(DesiredOutput) <- c("V1", "V2", "V1alone", "Overlap", "V2alone")

DesiredOutput
    V1   V2 V1alone Overlap V2alone
1 Pyro Anth       2       2       2
2 Pyro Tric       1       3       2
3 Anth Tric       0       4       1

How could this be automated to efficiently create my "DesiredOutput" table for dozens of organisms and hundreds of conditions?


Solution

  • You could try this approach:

    f <- function(v1,v2) list(sum(v1 & !v2),sum(v1 & v2),sum(!v1 & v2))
    
    result = data.table(t(combn(names(paData),2)))
    
    result[, c("v1alone", "overlap", "v2alone"):=f(paData[[V1]], paData[[V2]]), by=1:nrow(result)]
    

    Output:

         V1   V2 v1alone overlap v2alone
    1: Pyro Anth       2       2       2
    2: Pyro Tric       1       3       2
    3: Anth Tric       0       4       1