Search code examples
pythonsplitoverlap

A better way to split a sequence in chunks with overlaps?


I need a function to split an iterable into chunks with the option of having an overlap between the chunks.

I wrote the following code, which gives me the correct output but that is quite inefficient (slow). I can't figure out how to speed it up. Is there a better method?

def split_overlap(seq, size, overlap):
    '''(seq,int,int) => [[...],[...],...]
    Split a sequence into chunks of a specific size and overlap.
    Works also on strings! 

    Examples:
        >>> split_overlap(seq=list(range(10)),size=3,overlap=2)
        [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9]]

        >>> split_overlap(seq=range(10),size=3,overlap=2)
        [range(0, 3), range(1, 4), range(2, 5), range(3, 6), range(4, 7), range(5, 8), range(6, 9), range(7, 10)]

        >>> split_overlap(seq=list(range(10)),size=7,overlap=2)
        [[0, 1, 2, 3, 4, 5, 6], [5, 6, 7, 8, 9]]
    '''
    if size < 1 or overlap < 0:
        raise ValueError('"size" must be an integer with >= 1 while "overlap" must be >= 0')
    result = []
    while True:
        if len(seq) <= size:
            result.append(seq)
            return result
        else:
            result.append(seq[:size])
            seq = seq[size-overlap:]

Testing results so far:

l = list(range(10))
s = 4
o = 2
print(split_overlap(l,s,o))
print(list(split_overlap_jdehesa(l,s,o)))
print(list(nwise_overlap(l,s,o)))
print(list(split_overlap_Moinuddin(l,s,o)))
print(list(gen_split_overlap(l,s,o)))
print(list(itr_split_overlap(l,s,o)))

[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9]]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9), (8, 9, None, None)] #wrong
[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9], [8, 9]] #wrong
[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9]]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]

%%timeit
split_overlap(l,7,2)
718 ns ± 2.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
list(split_overlap_jdehesa(l,7,2))
4.02 µs ± 64.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(nwise_overlap(l,7,2))
5.05 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(split_overlap_Moinuddin(l,7,2))
3.89 µs ± 78.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(gen_split_overlap(l,7,2))
1.22 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
list(itr_split_overlap(l,7,2))
3.41 µs ± 36.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

With longer list as input:

l = list(range(100000))

%%timeit
split_overlap(l,7,2)
4.27 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
list(split_overlap_jdehesa(l,7,2))
31.1 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
list(nwise_overlap(l,7,2))
5.74 ms ± 66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(split_overlap_Moinuddin(l,7,2))
16.9 ms ± 89.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(gen_split_overlap(l,7,2))
4.54 ms ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(itr_split_overlap(l,7,2))
19.1 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

From other tests (not reported here), it turned out that for small lists len(list) <= 100, my original implementation split_overlap() is the fastest. But for anything larger than that, gen_split_overlap() is the most efficient solution so far.


Solution

  • Sometimes readability counts vs. speed. A simple generator that iterates over indices, producing slices gets the job done in reasonable time:

    def gen_split_overlap(seq, size, overlap):        
        if size < 1 or overlap < 0:
            raise ValueError('size must be >= 1 and overlap >= 0')
    
        for i in range(0, len(seq) - overlap, size - overlap):            
            yield seq[i:i + size]
    

    If you want to handle potentially infinite iterables, you just have to keep overlap items from the previous yield and slice size - overlap new items:

    def itr_split_overlap(iterable, size, overlap):
        itr = iter(iterable)
    
        # initial slice, in case size exhausts iterable on the spot
        next_ = tuple(islice(itr, size))
        yield next_
        # overlap for initial iteration
        prev = next_[-overlap:] if overlap else ()
    
        # For long lists the repeated calls to a lambda are slow, but using
        # the 2-argument form of `iter()` is in general a nice trick.
        #for chunk in iter(lambda: tuple(islice(itr, size - overlap)), ()):
    
        while True:
            chunk = tuple(islice(itr, size - overlap))
    
            if not chunk:
                break
    
            next_ = (*prev, *chunk)
            yield next_
    
            # overlap == 0 is a special case
            if overlap:
                prev = next_[-overlap:]