Search code examples
rlistsequencedata-manipulationdata-cleaning

Break some sequences of data into several adjacent pieces


I have a couple of sequences which I want to break into series of adjacent numbers. The sequences are nested within a list of individuals such that the size of the window that contains the adjacent numbers varies from one individual to another. Here are some example data:

#The sequences of three individuals
sequences <- list(c(1,2,3,5,6), c(2,3,4,5,6), c(1,3,4,6,7))

#The window size that contains the adjacent numbers
#for the first individual, 2 adjacent numbers should be bonded together and for the second, 3 should be bonded, etc.
windowsize <- list(2,3,4)

#The breakdown of the adjacent numbers should look like:
[[1]]
[[1]][[1]]
[1] 1 2
[[1]][[2]]
[1] 2 3
[[1]][[3]]
[1] 3 5
[[1]][[4]]
[1] 5 6

[[2]]
[[2]][[1]]
[1] 2 3 4
[[2]][[2]]
[1] 3 4 5
[[2]][[3]]
[1] 4 5 6

[[3]]
[[3]][[1]]
[1] 1 3 4 6
[[3]][[2]]
[1] 3 4 6 7

I have a much larger dataset than this and so I am thinking maybe writing a function will be the way to achieve this? Thank you!


Solution

  • We may use Map with embed from base R - loop over the corresponding elements of 'sequences', 'windowsize' in Map, create a matrix with embed with dimension specified as the element (y) from 'windowsize' and use asplit to split by row (MARGIN = 1)

    Map(function(x, y) asplit(embed(x, y)[, y:1], 1), sequences, windowsize)
    

    -output

    [[1]]
    [[1]][[1]]
    [1] 1 2
    
    [[1]][[2]]
    [1] 2 3
    
    [[1]][[3]]
    [1] 3 5
    
    [[1]][[4]]
    [1] 5 6
    
    
    [[2]]
    [[2]][[1]]
    [1] 2 3 4
    
    [[2]][[2]]
    [1] 3 4 5
    
    [[2]][[3]]
    [1] 4 5 6
    
    
    [[3]]
    [[3]][[1]]
    [1] 1 3 4 6
    
    [[3]][[2]]
    [1] 3 4 6 7
    

    If we want a matrix, just remove the asplit

    Map(function(x, y) embed(x, y)[, y:1], sequences, windowsize)
    [1]]
         [,1] [,2]
    [1,]    1    2
    [2,]    2    3
    [3,]    3    5
    [4,]    5    6
    
    [[2]]
         [,1] [,2] [,3]
    [1,]    2    3    4
    [2,]    3    4    5
    [3,]    4    5    6
    
    [[3]]
         [,1] [,2] [,3] [,4]
    [1,]    1    3    4    6
    [2,]    3    4    6    7