Search code examples
rfrequencyfind-occurrencesfrequency-analysis

R: Frequency of all column combinations


Problem description

I have a list of strings of equal size like this:

example.list <- c('BBCD','ABBC','ADDB','ACBB')

Then I want to obtain the frequency of occurence of specific letters at specific positions. First I convert this to a matrix:

     A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 A4 B4 C4 D4
[1,]  0  1  0  0  0  1  0  0  0  0  1  0  0  0  0  1
[2,]  1  0  0  0  0  1  0  0  0  1  0  0  0  0  1  0
[3,]  1  0  0  0  0  0  0  1  0  0  0  1  0  1  0  0
[4,]  1  0  0  0  0  0  1  0  0  1  0  0  0  1  0  0
[5,]  1  0  0  0  0  1  0  0  0  1  0  0  0  0  0  1

Now I want to obtain the frequency of each column combination. Some examples:

A1 : B2 = 2
A1 : B3 = 3
B1 : B2 = 1
.. etc

Solution

  • Split the strings into a list, s, of vectors of single characters. Set n to their common length and create a matrix v from s whose columns contain elements such as B1, etc. Then use xtabs to create counts giving m1 and combn to get counts of pairs in m2.

    s <- strsplit(example.list, "")
    n <- lengths(s)[1]
    v <- sapply(s, paste0, 1:n)
    m1 <- xtabs(~., data.frame(colv = c(col(v)), v = c(v)))
    m2 <- combn(1:ncol(m1), 2, function(ix) sum(m1[, ix[1]] * m1[, ix[2]]))
    names(m2) <- combn(colnames(m1), 2, paste, collapse = "")
    

    giving:

    > m1
        v
    colv A1 B1 B2 B3 B4 C2 C3 C4 D2 D3 D4
       1  0  1  1  0  0  0  1  0  0  0  1
       2  1  0  1  1  0  0  0  1  0  0  0
       3  1  0  0  0  1  0  0  0  1  1  0
       4  1  0  0  1  1  1  0  0  0  0  0
    
    > m2
    A1B1 A1B2 A1B3 A1B4 A1C2 A1C3 A1C4 A1D2 A1D3 A1D4 B1B2 B1B3 B1B4 B1C2 B1C3 B1C4 
       0    1    2    2    1    0    1    1    1    0    1    0    0    0    1    0 
    B1D2 B1D3 B1D4 B2B3 B2B4 B2C2 B2C3 B2C4 B2D2 B2D3 B2D4 B3B4 B3C2 B3C3 B3C4 B3D2 
       0    0    1    1    0    0    1    1    0    0    1    1    1    0    1    0 
    B3D3 B3D4 B4C2 B4C3 B4C4 B4D2 B4D3 B4D4 C2C3 C2C4 C2D2 C2D3 C2D4 C3C4 C3D2 C3D3 
       0    0    1    0    0    1    1    0    0    0    0    0    0    0    0    0 
    C3D4 C4D2 C4D3 C4D4 D2D3 D2D4 D3D4 
       1    0    0    0    1    0    0