Search code examples
perlrperl-data-structures

calculate co-occurrences


screenshot

I have a file as shown in the screenshot attached. There are 61 events(peaks) and I want to find how often each peak occurs with the other (co-occurrence) for all possible combinations. The file has the frequency (number of times the peak appears in the 47 samples) and probability (no.of times the peak occurs divided by total number of samples).

Then I want to find mutually exclusive peaks using the formula p(x,y) / p(x)*p(y), where p(x,y) is the probability that x and y co-occur, p(x) is probability of peak (x) and p(y) is the probability of peak y.

What is the best way to solve such a problem? Do I need to write a Perl script or are there some R functions I could use? I am a biologist trying to learn Perl and R so I would appreciate some example code to solve this problem.


Solution

  • In the following, I've assumed that what you alternately call p(xy) and p(x,y) should actually be the probability (rather than the number of times) that x and y co-occur. If that's not correct, just remove the division by nrow(X) from the 2nd line below.

    # As an example, create a sub-matrix of your data
    X <- cbind(c(0,0,0,0,0,0), c(1,0,0,1,1,1), c(1,1,0,0,0,0))
    
    num <- (t(X) %*% X)/nrow(X)              # The numerator of your expression   
    means <- colMeans(X)                     # A vector of means of each column
    denom <- outer(colMeans(X), colMeans(X)) # The denominator
    out <- num/denom
    #      [,1] [,2] [,3]
    # [1,]  NaN  NaN  NaN
    # [2,]  NaN 1.50 0.75
    # [3,]  NaN 0.75 3.00
    

    Note: The NaNs in the results are R's way of indicating that those cells are "Not a number" (since they are each the result of dividing 0 by 0).