Search code examples
rubyenumerable

Get slice of an Enumerator effectively


I am having troubles getting a slice of an infinite sequence of Enumerator instance in a reasonable time. I first tried drop and take chain, but it hanged forever as drop tried to return an infinite Array. Next, I switched the order of these methods, but I still have to wait about ten minutes to get 100 values after ten-milionth sample:

print exbioseq.drop(10**7).take(100)

Can anything be done to get a slice faster?


Solution

  • An Enumerator is very general interface, it makes only very simple assumptions about the "collection" it is traversing. In particular, it really only supports two operations: get the current element and iterate to the next element.

    Given those two operations, if you want to get the 10 millionth element, there is only one thing you can do: iterate 10 million times. Which takes time.

    There is no such thing as "slicing" an Enumerator. An Enumerator enumerates. That's it.

    Now, as you discovered, there is another problem: Ruby's collection operations are not type-preserving. No matter what type of collection you call map or select or take or whatever on, it will always return the same type: a fully realized, concrete, strict Array. That's how most collection frameworks in most languages work, e.g. in .NET all collection operations return IEnumerable. That's because most of these methods have only a single common implementation in the Enumerable mixin.

    Smalltalk is an exception, but there is another problem: the collection operations are duplicated for every single collection type. Every collection type has its own nearly-indetical practically copy&paste implementation of collect:, select: etc. This code duplication is hard to maintain and places a big burden on anyone who wants to integrate their own collection into the framework. In Ruby, it's easy: implement each, mixin Enumerable and you're done.

    Note: as of Ruby 1.9, there is actually some of that duplication: Hash implements its own version of select which does actually return a Hash and not an Array. So, now, not only is there code duplication but also an asymmetry in the interface: all implementations of select return Arrays except for the one in Hash.

    The Scala 2.8 collection framework is the first time ever that someone has figured out how to provide type-preserving collection operations without code duplication. But Ruby's collection framework was designed 15 years before Scala 2.8, so it cannot take advantage of that knowledge.

    In Ruby 2.0, there are lazy Enumerators, where all collection operations return another lazy Enumerator. But that won't help you here: the only difference is that the lazy Enumerator will delay the 10 million iterations until you actually print the values. It still has to perform those 10 million iterations because there is simply no way to do otherwise.

    If you want slicing, you need a sliceable data structure, such as an Array.