I came across that iterator() of HashSet takes too much time. Here is some code example which proves the observation:
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.stream.IntStream;
public class Main {
public static final int LENGTH = 100000;
public static void calculate() {
Set<Integer> set = new HashSet<>();
int part1 = 0;
int part2 = 0;
for (int i = 0; i < LENGTH; i++) {
set.add(i);
}
for (int i = 0; i < LENGTH; i++) {
long cur1 = System.currentTimeMillis();
Iterator<Integer> it = set.iterator();
long cur2 = System.currentTimeMillis();
part1 += cur2 - cur1;
int elem = it.next();
set.remove(elem);
long cur3 = System.currentTimeMillis();
part2 += cur3 - cur2;
}
System.out.println("Total milliseconds to execute iterator() code: " + part1);
System.out.println("Total milliseconds to execute remove() code: " + part2);
}
public static void main(String[] args) {
System.out.println(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(new Date()));
calculate();
System.out.println(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(new Date()));
}
}
The result of the run is:
2023-08-28 14:05:51.080
Total milliseconds to execute iterator() code: 9477
Total milliseconds to execute remove() code: 18
2023-08-28 14:06:00.595
I checked the documentation and implementation of iterator() and only saw that it should be a constant time but it doesn't seem so.
Does anybody know why is it takes so much time?
iterator()
normally takes constant time -- but to be specific, it takes time linear in the position of the first element in the hash table to create the iterator.
This is O(1) on average, except that each time, you remove the first element in the hash table, meaning that iterator()
actually ends up taking linear time in how many remove
s you've done. Let me diagram this with a simple example.
Let's imagine you have a hash table with four elements in four hash buckets, with elements assigned to buckets in some nonsensical order:
bucket 0: 4
bucket 1: 2
bucket 2: 1
bucket 3: 3
Now you create iterator()
, and its first element is 4. You remove it. Now it's
bucket 0: none
bucket 1: 2
bucket 2: 1
bucket 3: 3
Do this two more times, and on the last run it's
bucket 0: none
bucket 1: none
bucket 2: none
bucket 3: 3
...so it has to iterate through the whole thing to find that first eleement.
In general, it is the case that iterator()
performs worse on basic HashSet
s after removing. From the Javadoc:
Iterating over this set requires time proportional to the sum of the HashSet instance's size (the number of elements) plus the "capacity" of the backing HashMap instance (the number of buckets).
Since you have created a HashSet
with capacity for LENGTH
elements, iterating the whole set in general takes O(LENGTH). You have simply created an environment that's the worst possible case for iterator()
.
If you had done, for example, remove(i)
, it would certainly perform better, though it'd still perform worse over time, and certainly take O(n) for the last remove.
If you find yourself saying, "this is silly, why doesn't
Java do something better" -- well, 99.9% of code doesn't do this, and for the fraction that does, LinkedHashSet
fixes the whole issue.