Search code examples
javajava-streamlazy-evaluation

Java streams and lazy collections


I have a method which returns a Collection of objects by performing some costly operation (e.g. using a database for resolving objects by ther ID).

Let's use this method as an example:

public static Collection<Integer> getInts()
{
    return IntStream.range(1, 100)
            .peek((value) -> System.out.printf("getInts(): iterating over [%d]\n", value))
            .boxed()
            .collect(Collectors.toList());
}

Then I have another method which checks if an object which matches a predicate exists in that collection:

public static boolean primeExists()
{
    return getInts().stream()
        .peek((value) -> System.out.printf("primeExists(): iterating over [%d]\n", value))
        .anyMatch(Main::isPrime); // checks if there is a prime number
}

By running this code, I have noticed that the peek in getInts() method prints 10 times, one for each number, even though the primeExists() actually needs only the first two elements.

Here is a full code example:

public class Main
{
    private static boolean isPrime(int n)
    {
        if (n < 2)
        {
            return false;
        }

        for (int i = 2; i * i <= n; i++)
        {
            if (n % i == 0)
            {
                return false;
            }
        }

        return true;
    }

    public static Collection<Integer> getInts()
    {
        return IntStream.range(1, 10)
                .peek((value) -> System.out.printf("getInts(): iterating over [%d]\n", value))
                .boxed()
                .collect(Collectors.toList());
    }

    public static boolean primeExists()
    {
        return getInts().stream()
                        .peek((value) -> System.out.printf("primeExists(): iterating over [%d]\n", value))
                        .anyMatch(Main::isPrime);
    }

    public static void main(String[] args) throws IOException
    {
        primeExists();
    }
}

This is the output:

getInts(): iterating over [1]
getInts(): iterating over [2]
getInts(): iterating over [3]
getInts(): iterating over [4]
getInts(): iterating over [5]
getInts(): iterating over [6]
getInts(): iterating over [7]
getInts(): iterating over [8]
getInts(): iterating over [9]
primeExists(): iterating over [1]
primeExists(): iterating over [2]

How can I change this code to get this output instead:

getInts(): iterating over [1]
getInts(): iterating over [2]
primeExists(): iterating over [1]
primeExists(): iterating over [2]

Is it possible to get this behaviour without changing the getInts() method to return an intermediate Stream?

I think that the ideal solution is to return a lazy collection from getInts() (e.g. a collection whose elements are created only when they are actually iterated). Is there a way to create such collections in Java?


Solution

  • NO.

    Okay, I have answered the question. But let's provide a whole bunch of context:

    I think that the ideal solution is to return a lazy collection from getInts() (e.g. a collection whose elements are created only when they are actually iterated). Is there a way to create such collections in Java?

    You're confused about terminology. Often in officially defined systems (and a programming language is certainly far more specifically defined than your usual human-to-human conversations), terminology gains a far more specific, far more detailedly defined meaning.

    In your head, 'collection' could be lazy, why not, right? Collection is a word, in a dictionary, nothing in the dictionary dictates it MUST be eager.

    But in java, j.u.Collection IMPLIES eager. Asking for a lazy collection, 'collection' there referring to the english word, that makes sense. asking for 'a lazy Collection', with collection referring specifically to java.util.Collection, makes no sense. It's like asking for: "A circle but with a corner". It's an oxymoron. Given that you're in java land, using the word 'collection' and intending for that to be taken as the dictionary word and not j.u.Collection is a mistake: Words serve to communicate ideas, if the words don't do the job right, you've chosen bad words. It is, naturally, virtually impossible to attempt to disambiguate 'collection - in the sense of english dictionary' and 'collection - in the sense of j.u.Collection', so find another word to convey that notion.

    So, what is java's official term for 'a lazy collection-as-in-dictionary'? It is Stream.

    Said differently, you have a method whose return type is "EagerCollection" (Except in java we just call that Collection), and are asking: Is there any way to make that non-eager? Well, no. The spec is literally telling you it has to be eager.

    Thinking about the API of a hypothetical 'lazy collection', and going by the API of j.u.Collection for inspiration, it just wouldn't make sense. Lots of API needs to either:

    • Be removed entirely for not making sense / having nasty side-effects we would never want to expose.
    • Document that it 'consumes' the collection, with no way to return, and/or have ridiculous side-effects such as taking an hour and allocating 2 GB worth of RAM.
    • Just isn't a thing you generally would be interested in.
    • Needs to throw something.
    • Needs to return magic values that have weird meanings, such as size() being specced to return -1 for infinite collections and -2 for 'I have no idea, I might even be infinite'.
    • Handcuffs the notion into eliminating the ability of such a concept to apply to use cases even though they feel right (such as adding the handcuff that lazy collections are nevertheless required to accurately report size() in a quick and non-destructive manner, which means you can't make a lazy collection to represent data being piped in across the network).

    This is why Stream exists: Different API was required. This is why java.util.Collection (and particularly, java.util.List) inherently means 'eager'. Even though a few things (really, just iterator(), almost every other method is non-sensical or not implementable in a fast and non-destructive manner) could also apply to 'lazy collections'.

    As a trivial example, .get(1000) often involves either destruction (I could give you the 1000th element, but in doing so, elements 0-999 are consumed and can never be returned again, so if you call .get(500) afterwards, an exception will have to occur), or caching (I will allocate some memory and pull 1000 elements, caching 0-999, so that .get(500) later does actually work, though it grabs from cache).

    Instead in streams, you'd write .skip(1000).findFirst() which is about as simple and which is far more accurate in what it actually does - whereas with .get(1000) I have no idea if that is going to create a cache or crash or how the heck is that going to work, I'd have to read the docs, with .skip(1000).findFirst() it's obvious from just that little bit of text: That will destroy elements 0-999 (pull them from the source and discard them, I can never return to these).