Search code examples
rmatrixequationdata-wrangling

Co-occurrence calculations from a presence-absence matrix


I have this great presence-absence dataset in which I need to calculate a C score (CS) for each species pair (BABO, BW, RC, SKS, MANG) to measure species co-occurrences.

C score equation

ki and kj denote the numbers of occurrences of species i and j and K is the number of co-occurrences of both species.

I have looked at articles like Find the pair of most correlated variables and Turning co-occurrence counts into co-occurrence probabilities with cascalog but was unable to determine the most efficient way to go about it in R. I have tried creating a function but was unsuccessful.

My data:

data <- structure(list(group_id = c("2008-2-11.C3_900", "2008-2-11.C3_960", 
"2008-2-11.C3_1200", "2008-2-11.C3_1230", "2008-2-11.C3_1460", 
"2008-2-11.C3_1490", "2008-2-22.Mwani_0", "2008-2-22.Mwani_110", 
"2008-2-22.Mwani_600", "2008-2-22.Mwani_1650", "2008-2-20.Sanje_150", 
"2008-2-20.Sanje_410", "2008-2-20.Sanje_3000", "2008-5-9.C3_900", 
"2008-5-13.Mwani_750", "2008-5-13.Mwani_800", "2008-5-13.Mwani_900", 
"2008-5-13.Mwani_1080", "2008-5-13.Mwani_1800", "2008-5-13.Mwani_2200", 
"2008-5-13.Mwani_2900"), BABO = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0), BW = c(1, 1, 1, 1, 0, 0, 
0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1), RC = c(0, 0, 1, 
1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0), SKS = c(0, 
1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0), 
    MANG = c(0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 1)), row.names = c(NA, -21L), class = c("tbl_df", 
"tbl", "data.frame"))
group_id               BABO BW RC SKS MANG
<chr>                  <dbl><dbl><dbl><dbl>
2008-2-11.C3_900        0   1   0   0   0
2008-2-11.C3_960        0   1   0   1   0
2008-2-11.C3_1200       0   1   1   1   0
2008-2-11.C3_1230       0   1   1   0   0
2008-2-11.C3_1460       0   0   1   0   0
2008-2-11.C3_1490       0   0   0   1   0
2008-2-22.Mwani_0       0   0   0   1   1
2008-2-22.Mwani_110     0   0   1   0   1
2008-2-22.Mwani_600     0   1   0   0   0
2008-2-22.Mwani_1650    0   0   0   1   0
2008-2-20.Sanje_150     0   1   0   1   0
2008-2-20.Sanje_410     0   0   1   0   0
2008-2-20.Sanje_3000    0   0   1   0   0
2008-5-9.C3_900         0   0   0   1   0
2008-5-13.Mwani_750     0   0   0   1   0
2008-5-13.Mwani_800     1   0   1   0   0
2008-5-13.Mwani_900     0   0   1   0   0
2008-5-13.Mwani_1080    0   1   1   0   0
2008-5-13.Mwani_1800    0   1   0   0   0
2008-5-13.Mwani_2200    0   1   0   0   0
2008-5-13.Mwani_2900    0   1   0   0   1

Solution

  • First define a function to compute CS for a given pair of species; then use combn() to generate all possible pairs; then pass each pair to the CS() function using purrr::map2_dbl().

    library(dplyr)
    library(purrr)
    
    CS <- function(i, j, data) {
      K <- sum(data[[i]] == 1 & data[[j]] == 1)
      ki <- sum(data[[i]])
      kj <- sum(data[[j]])
      ((ki - K) * (kj - K)) / (ki * kj)
    }
    
    names(sp_data)[-1] %>%
      combn(2) %>%
      t() %>%
      as_tibble() %>%
      set_names(c("i", "j")) %>%
      mutate(CS = map2_dbl(i, j, CS, data = sp_data))
    
    # A tibble: 10 × 3
       i     j        CS
       <chr> <chr> <dbl>
     1 BABO  BW    1    
     2 BABO  RC    0    
     3 BABO  SKS   1    
     4 BABO  MANG  1    
     5 BW    RC    0.467
     6 BW    SKS   0.438
     7 BW    MANG  0.6  
     8 RC    SKS   0.778
     9 RC    MANG  0.593
    10 SKS   MANG  0.583
    

    (Side note: using the equation you provided, species with little co-occurrence end up with high CS scores, and vice versa. If this isn’t what you expected, perhaps your equation should be subtracted from 1?)