Search code examples
rdataframetidyverselapplysubsequence

R - Find all sequences and their frequencies in a data frame


Please, I have this data.frame:

10  34  35  39  55  43
10  32  33  40  45  48
10  35  36  38  41  43
30  31  32  34  36  49
39  55  40  43  45  50
30  32  35  36  49  50
 2   8   9  39  55  43
 1   2   8  12  55  43
 2   8  12  55  43  61
 2   8  55  43  61  78

I'd like to find all sequences (where length > 2) for all rows and group by the frequency (where frequency > 1). In this case, need to show

sequence               frequency
[39  55  43]           3
[10  35  43]           2
[32  36  49]           2
[30  32  36]           2
[30  32  36  49]       2
[ 2   8  55]           4
[ 2   8  55  43]       4
[ 2   8  55  43  61]   2

Is it possible to do this in R?


Solution

  • You can write a function subseqs that can enumerate all sub-sequences of each row, then summarize the frequency using table

    subseqs <- function(v) sapply(3:length(v), function(k) combn(v,k,FUN = toString))
    
    f <- table(unlist(apply(df, 1, subseqs)),dnn = "sequence")
    
    dfout <- data.frame(f[f>=2])
    

    such that

    > dfout
               sequence Freq
    1        10, 35, 43    2
    2        12, 55, 43    2
    3         2, 12, 43    2
    4         2, 12, 55    2
    5     2, 12, 55, 43    2
    6         2, 43, 61    2
    7         2, 55, 43    4
    8     2, 55, 43, 61    2
    9         2, 55, 61    2
    10         2, 8, 12    2
    11     2, 8, 12, 43    2
    12     2, 8, 12, 55    2
    13 2, 8, 12, 55, 43    2
    14         2, 8, 43    4
    15     2, 8, 43, 61    2
    16         2, 8, 55    4
    17     2, 8, 55, 43    4
    18 2, 8, 55, 43, 61    2
    19     2, 8, 55, 61    2
    20         2, 8, 61    2
    21       30, 32, 36    2
    22   30, 32, 36, 49    2
    23       30, 32, 49    2
    24       30, 36, 49    2
    25       32, 36, 49    2
    26       39, 55, 43    3
    27       55, 43, 61    2
    28        8, 12, 43    2
    29        8, 12, 55    2
    30    8, 12, 55, 43    2
    31        8, 43, 61    2
    32        8, 55, 43    4
    33    8, 55, 43, 61    2
    34        8, 55, 61    2
    

    DATA

    df <- structure(list(V1 = c(10L, 10L, 10L, 30L, 39L, 30L, 2L, 1L, 2L, 
    2L), V2 = c(34L, 32L, 35L, 31L, 55L, 32L, 8L, 2L, 8L, 8L), V3 = c(35L, 
    33L, 36L, 32L, 40L, 35L, 9L, 8L, 12L, 55L), V4 = c(39L, 40L, 
    38L, 34L, 43L, 36L, 39L, 12L, 55L, 43L), V5 = c(55L, 45L, 41L, 
    36L, 45L, 49L, 55L, 55L, 43L, 61L), V6 = c(43L, 48L, 43L, 49L, 
    50L, 50L, 43L, 43L, 61L, 78L)), class = "data.frame", row.names = c(NA, 
    -10L))