Search code examples
pythoniterationlist-comprehensionlazy-evaluation

python: efficient and pythonic way to get sublist from a descending list which is larger than threshold


Given a list with descending order, e.g. [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0, -1, -2, -2] and threshold = 1.2, I want to get sublist from original list with all elements larger than threshold

Method1:

orgin_lst = [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0, -1, -2, -2]
lst = [i for i in orgin_lst if i > threshold]

This is pythonic way but we don't use the descending property and cannot break out when found a element not larger than threshold. If there are few satisfied elements but oringal list is very large, the performance is not good.

Method2:

orgin_lst = [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0, -1, -2, -2]
lst = []
for i in orgin_lst:
    if i <= threshold:
        break
    lst.append(i)

However this code is not quite pythonic.

Is there a way that I can combine pythonic style and performance?


Solution

  • Python 3.10+

    Binary search is fast for sorted data, O(log n) time. And Python's bisect module already does it. It wants increasing data and yours is decreasing, but we can virtually make it increasing. Just use its shiny new key parameter to negate the O(log n) accessed elements (and search for the negated threshold):

    from bisect import bisect_left
    from operator import neg
    
    i = bisect_left(orgin_lst, -threshold, key=neg)
    lst = orgin_lst[:i]
    

    Alternatively, use a key function that returns False for values larger than the threshold and True otherwise. Since False is smaller than True (they act like 0 and 1, respectively), we again have a monotonically increasing sequence and can use bisect with this:

    from bisect import bisect
    
    i = bisect(orgin_lst, False, key=lambda x: x <= threshold)
    lst = orgin_lst[:i]
    

    If you don't need a separate new list, you could use del orgin_lst[i:] to instead remove the unwanted elements.

    Before Python 3.10

    Previously I would've written a proxy class to do the job now done by that much more convenient key parameter:

    from bisect import bisect_left
    
    class Negate:
        def __getitem__(_, i):
            return -orgin_lst[i]
    
    i = bisect_left(Negate(), -threshold, 0, len(orgin_lst))
    lst = orgin_lst[:i]
    

    Or I might've written binary search myself, but I've done that so many times that at some point I started to loathe it.

    Exponential search

    Under your Method1, the list comprehension comparing every element, you wrote: "If there are few satisfied elements but oringal list is very large, the performance is not good". If that was not just an argument against that list comprehension but you actually do have mostly very few satisfied elements and a very long list, then exponential search could be better than binary search. But it would be more code (unless you find a package for it, I guess).

    A simple iterative search like your Method2 (which I btw do find pythonic) or Chris' answer or with itertools.takewhile would also be fast in such extreme cases, but for cases with large numbers of satisfied elements, they'd be much slower than binary search and exponential search.

    itertools.takewhile

    Like I said it would be slower in general, but it's fast for those best-cases and it's quite simple and clean:

    from itertools import takewhile
    
    lst = list(takewhile(lambda x: x > threshold, orgin_lst))
    

    Faster loop

    Like I said I do find your loop pythonic, and it's good for best-cases. But calling append to individually append elements to the result is quite costly. Would probably be faster to at first just find the first too small element, then find its index and slice:

    for i in orgin_lst:
        if i <= threshold:
            lst = orgin_lst[:orgin_lst.index(i)]
            break
    else:
        lst = orgin_lst[:]
    

    Again, if you're ok with just removing the unwanted elements from the existing list, use del inside the if and then you don't need the else part here.

    A similar solution I wrote for another question ended up second-fastest in the benchmark there.

    Alternative implementation:

    cut = None
    for i in orgin_lst:
        if i <= threshold:
            cut = orgin_lst.index(i)
            break
    lst = orgin_lst[:cut]