I've been reading about Kotlin sequences and saw that the distinct
method is a stateful intermediate
operation. I'm guessing it's stateful because obviously it needs to keep track of what elements it just saw to be able to remove duplicates; and also some code somewhere actually doing the duplicate check right ?
When I checked the time my code takes while not using Sequence.distinct
, it took 50920 ms
to execute. But with the distinct
operation it took even less time, 32548 ms
. This doesn't make sense to me since there should be extra code to check for duplicates.. What's the tradeoff here and what did I miss?
Here's the code i used to test, i just compose an alphabet sequence multiple times then call distinct
:
val alphabet = sequenceOf( "", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9")
fun iterate(digits: Int): Sequence<String> {
var sequence = alphabet.asSequence()
for (i in 1 until digits)
sequence = iterate(sequence)
return sequence.distinct() // <--- this is where i add or remove the distinct
}
fun iterate(sequence: Sequence<String>) = sequence {
sequence.forEach { c1 ->
alphabet.forEach { c2 ->
yield("$c1$c2")
}
}
}
While testing i've been doing :
iterate(7).forEach(::println)
So is it possible that the println
takes more time than the duplicate check since I/O
is always heavy ? I'm completely clueless and dumb-founded so any explanation would be greatly appreciated.
EDIT:
Based on cactustictacs
answer, I re-ran the code while generating 7 digit words so it's 11^7 == 19'487'171
total words being generated (I tried 8 digit words but I keep getting an OutOfMemoryError
every time i used distinct
, probably due to it keeping track of the elements it found so i can't really compare with more).
Here's the results (in ms) :
Run # | With distinct | Without distinct |
---|---|---|
1 | 34015 | 47687 |
2 | 32454 | 48232 |
3 | 32842 | 47384 |
4 | 31037 | 47084 |
5 | 32224 | 48030 |
6 | 31660 | 47129 |
7 | 31678 | 48407 |
8 | 32123 | 51186 |
9 | 35136 | 48671 |
10 | 32245 | 47698 |
So it's the same code, just with an extra distinct()
operation at the end? Yeah that should make it take longer, since you're doing extra work that doesn't make it any more efficient.
But the actual execution time depends on system resources while the loop is running - if the system is busy, you'll see longer times. If you want to benchmark this stuff, you need to run it multiple times, and pick the most consistent numbers - maybe a median result (so you're ignoring any outliers)
You might need a much larger final sequence to see any notable consistent difference between running distinct
or not though!
edit - I missed that you're printing every item in the sequence with some very large numbers, in that case yeah you're right, that's probably gonna be the bottleneck! Calling distinct()
is going to cut down that work significantly here, because the way you're generating the sequence produces a lot of duplicates, especially as your iteration count gets larger:
fun main() {
for (i in 1..5) {
val results = iterate(i)
val total = results.count()
val distinct = results.distinct().count()
println("$i) total: $total - difference: ${total - distinct}")
}
}
1) total: 11 - difference: 0
2) total: 121 - difference: 10
3) total: 1331 - difference: 220
4) total: 14641 - difference: 3530
5) total: 161051 - difference: 49940
By 5 iterations you're stripping out almost a third of the results! That kind of work is generally fast, printing to output isn't so much. Try benchmarking each stage (the processing and the printing) and see what their averages are - because it's a sequence, you'll need to consume it to just do the "processing" step alone, running count()
is a lightweight way to do that. Then you can compare that to the "processing + printing" version and see how much overhead the print adds
FYI (in case you're not doing this as a stress test) you can avoid the distinct
call by just yielding each digit as part of each iteration:
// you don't really gain anything by having this as a sequence (it's slower too),
// but I'm keeping it the same for comparison's sake
val alphabet = (0..9).map(Int::toString).asSequence()
fun iterate(sequence: Sequence<String>) = sequence {
// combo each item with each element in the sequence (if any)
// this produces items at least two-digits long
sequence.forEach { c1 ->
alphabet.forEach { c2 ->
yield("$c1$c2")
}
}
// also yield all the single digits
alphabet.forEach( { yield(it) })
}
That way the 1st iteration just produces single-digit items. The next extends each of those to create all the two-digit combos, and adds the single-digit items in too. The 3rd iteration extends each of those, creating all the 3-digit and 2-digit combinations, and adds the single-digit ones back in...
You're never producing duplicates, because each round is effectively just adding all the possible extensions of the current sequence. Using a blank character like you're doing complicates that, because each round adds extended versions of each item, but also non-extended duplicates of everything. It's easier to just turn all the x-length items into all the possible x+1-length items, then throw all the single-digit items back in (since the ones in the previous sequence got extended to two digits)
I hope that's not too confusing, I feel like it's an overcomplicated explanation. Also this way doesn't yield an empty string as part of the sequence (the combo of "" + "" + ""...
) - if you need that you could do return sequence.plus("")
or something
I get a 2-3x speed increase with this for 5 iterations - it's not just the distinct
call, you're avoiding doing duplicate work in the first place, and with more iterations that inefficiency can really add up! Especially for sequence
s, which are memory-efficient and let you stop earlier, but run a lot slower than iterable
s