Search code examples
pythonlisttuplesseparator

Split list by tuple separator


I have list:

print (L)
[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ'), 
 ('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]

I want split list to sublists with separator ('.', 'ZZ'):

print (new_L)
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')], 
 [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]

I am interested about another possible solutions, performance is important.


Solution

  • The for-loop approach will be faster, this requires only one-pass:

    >>> def juan(L, sep):
    ...     L2 = []
    ...     sub = []
    ...     for x in L:
    ...         sub.append(x)
    ...         if x == sep:
    ...             L2.append(sub)
    ...             sub = []
    ...     if sub:
    ...         L2.append(sub)
    ...     return L2
    ...
    >>> juan(L, sep)
    [[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')], [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
    

    Some comparisons:

    >>> def jezrael(L, sub):
    ...     return [list(g) + [sep] for k, g in groupby(L, lambda x: x==sep) if not k]
    ...
    >>> def coldspeed(L, sep):
    ...     L2 = []
    ...     for i in reversed(L):
    ...         if i == sep:
    ...             L2.append([])
    ...         L2[-1].append(i)
    ...     return [x[::-1] for x in reversed(L2)]
    ...
    >>> def pm2ring(L, sep):
    ...     seplist = [sep]
    ...     return [list(g) + seplist for k, g in groupby(L, sep.__eq__) if not k]
    ...
    >>> setup = "from __main__ import L, sep, juan, coldspeed, pm2ring, jezrael"
    

    Edit: more timings

    >>> def buzzycoder(L, sep):
    ...     a = []
    ...     length = len(L)
    ...     start = 0
    ...     end = L.index(sep)
    ...     if start < length: a.append(L[start:end+1])
    ...     start = end + 1
    ...     while start < length:
    ...         end = L.index(sep, start) + 1
    ...         a.append(L[start:end])
    ...         start = end
    ...     return a
    ...
    
    >>> def splitList(l, s):
    ...     ''' l is list, s is separator, simular to split, but keep separator'''
    ...     i = 0
    ...     for _ in range(l.count(s)): # break using slices
    ...         e = l.index(s,i)
    ...         yield l[i:e+1] # sublist generator value
    ...         i = e+1
    ...     if e+1 < len(l): yield l[e+1:] # pick up
    ...
    
    >>> def bharath(x,sep):
    ...     n = [0] + [i+1 for i,j in enumerate(x) if j == sep]
    ...     m= list()
    ...     for first, last in zip(n, n[1:]):
    ...         m.append(x[first:last])
    ...     return m
    ...
    

    And the results:

    >>> timeit.timeit("jezrael(L, sep)", setup)
    4.1499102029483765
    >>> timeit.timeit("pm2ring(L, sep)", setup)
    3.3499899921007454
    >>> timeit.timeit("coldspeed(L, sep)", setup)
    2.868469718960114
    >>> timeit.timeit("juan(L, sep)", setup)
    1.5428746730322018
    >>> timeit.timeit("buzzycoder(L, sep)", setup)
    1.5942967369919643
    >>> timeit.timeit("list(splitList(L, sep))", setup)
    2.7872562300181016
    >>> timeit.timeit("bharath(L, sep)", setup)
    2.9842335029970855
    

    With a bigger list:

    >>> L = L*100000
    >>> timeit.timeit("jezrael(L, sep)", setup, number=10)
    3.3555950550362468
    >>> timeit.timeit("pm2ring(L, sep)", setup, number=10)
    2.337177241919562
    >>> timeit.timeit("coldspeed(L, sep)", setup, number=10)
    2.2037084710318595
    >>> timeit.timeit("juan(L, sep)", setup, number=10)
    1.3625159269431606
    >>> timeit.timeit("buzzycoder(L, sep)", setup, number=10)
    1.4375156159512699
    >>> timeit.timeit("list(splitList(L, sep))", setup, number=10)
    1.6824725979240611
    >>> timeit.timeit("bharath(L, sep)", setup, number=10)
    1.5603888860205188
    

    Caveat

    The results do not address performance given the proportion of sep in L, which will affect timings a lot for some of these solutions.