Search code examples
kotlinsequenceexecution-time

Distinct sequence takes less time to iterate compared to normal sequence


I've been reading about Kotlin sequences and saw that the distinct method is a stateful intermediate operation. I'm guessing it's stateful because obviously it needs to keep track of what elements it just saw to be able to remove duplicates; and also some code somewhere actually doing the duplicate check right ?

When I checked the time my code takes while not using Sequence.distinct, it took 50920 ms to execute. But with the distinct operation it took even less time, 32548 ms. This doesn't make sense to me since there should be extra code to check for duplicates.. What's the tradeoff here and what did I miss?

Here's the code i used to test, i just compose an alphabet sequence multiple times then call distinct:

val alphabet = sequenceOf( "", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9")

fun iterate(digits: Int): Sequence<String> {
    var sequence = alphabet.asSequence()
    for (i in 1 until digits)
        sequence = iterate(sequence)
    return sequence.distinct()    // <--- this is where i add or remove the distinct
}

fun iterate(sequence: Sequence<String>) = sequence {
    sequence.forEach { c1 ->
        alphabet.forEach { c2 ->
            yield("$c1$c2")
        }
    }
}

While testing i've been doing :

iterate(7).forEach(::println)

So is it possible that the println takes more time than the duplicate check since I/O is always heavy ? I'm completely clueless and dumb-founded so any explanation would be greatly appreciated.

EDIT:

Based on cactustictacs answer, I re-ran the code while generating 7 digit words so it's 11^7 == 19'487'171 total words being generated (I tried 8 digit words but I keep getting an OutOfMemoryError every time i used distinct, probably due to it keeping track of the elements it found so i can't really compare with more).

Here's the results (in ms) :

Run # With distinct Without distinct
1 34015 47687
2 32454 48232
3 32842 47384
4 31037 47084
5 32224 48030
6 31660 47129
7 31678 48407
8 32123 51186
9 35136 48671
10 32245 47698

Solution

  • So it's the same code, just with an extra distinct() operation at the end? Yeah that should make it take longer, since you're doing extra work that doesn't make it any more efficient.

    But the actual execution time depends on system resources while the loop is running - if the system is busy, you'll see longer times. If you want to benchmark this stuff, you need to run it multiple times, and pick the most consistent numbers - maybe a median result (so you're ignoring any outliers)

    You might need a much larger final sequence to see any notable consistent difference between running distinct or not though!


    edit - I missed that you're printing every item in the sequence with some very large numbers, in that case yeah you're right, that's probably gonna be the bottleneck! Calling distinct() is going to cut down that work significantly here, because the way you're generating the sequence produces a lot of duplicates, especially as your iteration count gets larger:

    fun main() {
        for (i in 1..5) {
            val results = iterate(i)
            val total = results.count()
            val distinct = results.distinct().count()
            println("$i) total: $total - difference: ${total - distinct}")
        }
    }
    
    1) total: 11 - difference: 0
    2) total: 121 - difference: 10
    3) total: 1331 - difference: 220
    4) total: 14641 - difference: 3530
    5) total: 161051 - difference: 49940
    

    By 5 iterations you're stripping out almost a third of the results! That kind of work is generally fast, printing to output isn't so much. Try benchmarking each stage (the processing and the printing) and see what their averages are - because it's a sequence, you'll need to consume it to just do the "processing" step alone, running count() is a lightweight way to do that. Then you can compare that to the "processing + printing" version and see how much overhead the print adds


    FYI (in case you're not doing this as a stress test) you can avoid the distinct call by just yielding each digit as part of each iteration:

    // you don't really gain anything by having this as a sequence (it's slower too),
    // but I'm keeping it the same for comparison's sake
    val alphabet = (0..9).map(Int::toString).asSequence()
    
    fun iterate(sequence: Sequence<String>) = sequence {
        // combo each item with each element in the sequence (if any)
        // this produces items at least two-digits long
        sequence.forEach { c1 ->
            alphabet.forEach { c2 ->
                yield("$c1$c2")
            }
        }
        // also yield all the single digits
        alphabet.forEach( { yield(it) })
    }
    

    That way the 1st iteration just produces single-digit items. The next extends each of those to create all the two-digit combos, and adds the single-digit items in too. The 3rd iteration extends each of those, creating all the 3-digit and 2-digit combinations, and adds the single-digit ones back in...

    You're never producing duplicates, because each round is effectively just adding all the possible extensions of the current sequence. Using a blank character like you're doing complicates that, because each round adds extended versions of each item, but also non-extended duplicates of everything. It's easier to just turn all the x-length items into all the possible x+1-length items, then throw all the single-digit items back in (since the ones in the previous sequence got extended to two digits)

    I hope that's not too confusing, I feel like it's an overcomplicated explanation. Also this way doesn't yield an empty string as part of the sequence (the combo of "" + "" + ""...) - if you need that you could do return sequence.plus("") or something

    I get a 2-3x speed increase with this for 5 iterations - it's not just the distinct call, you're avoiding doing duplicate work in the first place, and with more iterations that inefficiency can really add up! Especially for sequences, which are memory-efficient and let you stop earlier, but run a lot slower than iterables