Search code examples
pythoniteratorgeneratoriterable

Iterables / generators with specified length


Iterable objects are those that implement __iter__ function, which returns an iterator object, i.e. and object providing the functions __iter__ and __next__ and behaving correctly. Usually the size of the iterable object is not known beforehand, and iterable object is not expected to know how long the iteration will last; however, there are some cases in which knowing the length of the iterable is valuable, for example, when creating an array. list(x for x in range(1000000)), for example, creates an initial array of small size, copies it after it is full, and repeats for many times as explained here. Of course, it is not that important in this example, but it explains the point.

Is there a protocol in use for those iterable objects who know their length beforehand? That is, is there a protocol extending Sized and Iterable but not Collection or Reversible? It seems like there is no such protocol in language features, is there such a protocol for well-known third-party libraries? How this discussion relates to generators?


Solution

  • It sounds like you're asking about something like __length_hint__. Excerpts from PEP 424 – A method for exposing a length hint:

    CPython currently defines a __length_hint__ method on several types, such as various iterators. This method is then used by various other functions (such as list) to presize lists based on the estimate returned by __length_hint__. Types which are not sized, and thus should not define __len__, can then define __length_hint__, to allow estimating or computing a size (such as many iterators).

    Being able to pre-allocate lists based on the expected size, as estimated by __length_hint__, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.

    For example, range iterators support this (Try it online!):

    it = iter(range(1000))
    print(it.__length_hint__())     # prints 1000
    next(it)
    print(it.__length_hint__())     # prints 999
    

    And list iterators even take list length changes into account (Try it online!):

    a = [None] * 10
    it = iter(a)
    print(it.__length_hint__())     # prints 10
    next(it)
    print(it.__length_hint__())     # prints 9
    a.pop()
    print(it.__length_hint__())     # prints 8
    a.append(None)
    print(it.__length_hint__())     # prints 9
    

    Generator iterators don't support it, but you can support it in other iterators you write. Here's a demo iterator that...

    • Produces 10,000 elements.
    • Hints at having 5,000 elements.
    • After every 1,000 elements it shows the memory size of the list being built.
    import gc
    
    beacon = object()
    
    class MyIterator:
        def __init__(self):
            self.n = 10_000
        def __iter__(self):
            return self
        def __length_hint__(self):
            print('__length_hint__ called')
            return 5_000
        def __next__(self):
            if self.n == 0:
                raise StopIteration
            self.n -= 1
            if self.n % 1_000 == 0:
                for obj in gc.get_objects():
                    if isinstance(obj, list) and obj and obj[0] is beacon:
                        print(obj.__sizeof__())
            return beacon
    
    list(MyIterator())
    

    Output (Try it online!):

    __length_hint__ called
    45088
    45088
    45088
    45088
    45088
    50776
    57168
    64360
    72456
    81560
    

    We see that list asks for a length hint and from the start pre-allocates enough memory for 5,000 references of 8 bytes each, plus 12.5% overallocation. After the first 5,000 elements, it doesn't ask for length hints anymore, and keeps increasing its size bit by bit.

    If my __length_hint__ instead accurately returns 10,000, then list instead pre-allocates 90088 bytes and that remains until the end.