Iterable objects are those that implement __iter__
function, which returns an iterator object, i.e. and object providing the functions __iter__
and __next__
and behaving correctly. Usually the size of the iterable object is not known beforehand, and iterable object is not expected to know how long the iteration will last; however, there are some cases in which knowing the length of the iterable is valuable, for example, when creating an array. list(x for x in range(1000000))
, for example, creates an initial array of small size, copies it after it is full, and repeats for many times as explained here. Of course, it is not that important in this example, but it explains the point.
Is there a protocol in use for those iterable objects who know their length beforehand? That is, is there a protocol extending Sized and Iterable but not Collection or Reversible? It seems like there is no such protocol in language features, is there such a protocol for well-known third-party libraries? How this discussion relates to generators?
It sounds like you're asking about something like __length_hint__
. Excerpts from PEP 424 – A method for exposing a length hint:
CPython currently defines a
__length_hint__
method on several types, such as various iterators. This method is then used by various other functions (such aslist
) to presize lists based on the estimate returned by__length_hint__
. Types which are not sized, and thus should not define__len__
, can then define__length_hint__
, to allow estimating or computing a size (such as many iterators).
Being able to pre-allocate lists based on the expected size, as estimated by
__length_hint__
, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.
For example, range
iterators support this (Try it online!):
it = iter(range(1000))
print(it.__length_hint__()) # prints 1000
next(it)
print(it.__length_hint__()) # prints 999
And list
iterators even take list length changes into account (Try it online!):
a = [None] * 10
it = iter(a)
print(it.__length_hint__()) # prints 10
next(it)
print(it.__length_hint__()) # prints 9
a.pop()
print(it.__length_hint__()) # prints 8
a.append(None)
print(it.__length_hint__()) # prints 9
Generator iterators don't support it, but you can support it in other iterators you write. Here's a demo iterator that...
import gc
beacon = object()
class MyIterator:
def __init__(self):
self.n = 10_000
def __iter__(self):
return self
def __length_hint__(self):
print('__length_hint__ called')
return 5_000
def __next__(self):
if self.n == 0:
raise StopIteration
self.n -= 1
if self.n % 1_000 == 0:
for obj in gc.get_objects():
if isinstance(obj, list) and obj and obj[0] is beacon:
print(obj.__sizeof__())
return beacon
list(MyIterator())
Output (Try it online!):
__length_hint__ called
45088
45088
45088
45088
45088
50776
57168
64360
72456
81560
We see that list
asks for a length hint and from the start pre-allocates enough memory for 5,000 references of 8 bytes each, plus 12.5% overallocation. After the first 5,000 elements, it doesn't ask for length hints anymore, and keeps increasing its size bit by bit.
If my __length_hint__
instead accurately returns 10,000, then list
instead pre-allocates 90088
bytes and that remains until the end.