Search code examples
rcountcombinationssapplystrsplit

R: Count all combinations in a list of strings (Specific Order)


I am trying to count all sequences in a large list of characters delimetered by ">" but only the combinations that are directly next to each other.

e.g. given the character vector:

[1]Social>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>OrganicSearch>OrganicSearch>OrganicSearch
[2]Referral>Referral>Referral

I can run the following line to retrieve all combinations with of 2 characters:

split_fn <- sapply(p , strsplit , split = ">", perl=TRUE)

split_fn <- sapply(split_fn, function(x) paste(head(x,-1) , tail(x,-1) , sep = ">") )

Returns:

[[1]]

 [1] "Social>PaidSearch"           "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"      
 [6] "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"      
[11] "PaidSearch>OrganicSearch"    "OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch"

[[2]]

[1] "Referral>Referral" "Referral>Referral"

Which is all possible 2 character sequences in my data (splits in order)

I know want to have all possible outcomes of 3 characters.

e.g.

"Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"..."Referral>Referral>Referral"

Tried to use

unlist(lapply(strsplit(p, split = ">"), function(i) combn(sort(i), 3, paste, collapse='>')))

But it returns all combinations including those that aren't directly following.

I also don't want it to return combinations of the last value in row one with the first value in row 2 etc.


Solution

  • Let's start with creating some data:

    set.seed(1)
    
    data <- lapply(1:3, function(i) sample(LETTERS[1:3], rpois(1, 6), re = T))
    data <- sapply(data, paste, collapse = ">")
    
    data
    #> [1] "B>B>C>A"           "C>B>B>A>A>A>C>B>C" "C>C>B>C>C>A"
    

    Given the problem, it makes sense to think of these data as a list of vectors that we get after splitting the elements by the delimiter >:

    strsplit(data, ">")
    #> [[1]]
    #> [1] "B" "B" "C" "A"
    #> 
    #> [[2]]
    #> [1] "C" "B" "B" "A" "A" "A" "C" "B" "C"
    #> 
    #> [[3]]
    #> [1] "C" "C" "B" "C" "C" "A"
    

    Now, the core of the problem is to find all consecutive sequences of a given length from a single vector. Once we can do that, it's simple to apply over the list of data that we have; transforming back to the delimited format will also be simple.

    With that goal in mind, we can then make a function for extracting the sequences; here we just loop over each element and extract all sequences of the given length to a list:

    seqs <- function(x, length = 2) {
      if (length(x) < length)
        return(NULL)
      k <- length - 1
      lapply(seq_len(length(x) - k), function(i) x[i:(i + k)])
    }
    

    We can now just apply the function accross the data after splitting the delimited characters into vectors to get the result. We also need an additional sapply with paste to transform the data back into the delimited format that we started with:

    lapply(strsplit(data, ">"), function(x) {
      sapply(seqs(x, 3), paste, collapse = ">")
    })
    #> [[1]]
    #> [1] "B>B>C" "B>C>A"
    #> 
    #> [[2]]
    #> [1] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
    #> 
    #> [[3]]
    #> [1] "C>C>B" "C>B>C" "B>C>C" "C>C>A"
    

    Further, to get sequences of multiple lengths at the same time, we can add another layer of iteration:

    lapply(strsplit(data, ">"), function(x) {
      unlist(sapply(c(2, 3), function(n) {
        sapply(seqs(x, n), paste, collapse = ">")
      }))
    })
    #> [[1]]
    #> [1] "B>B"   "B>C"   "C>A"   "B>B>C" "B>C>A"
    #> 
    #> [[2]]
    #>  [1] "C>B"   "B>B"   "B>A"   "A>A"   "A>A"   "A>C"   "C>B"   "B>C"  
    #>  [9] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
    #> 
    #> [[3]]
    #> [1] "C>C"   "C>B"   "B>C"   "C>C"   "C>A"   "C>C>B" "C>B>C" "B>C>C" "C>C>A"
    

    Created on 2018-05-21 by the reprex package (v0.2.0).