I've been fooling around with a bunch of different ways of searching collections, collections of collections, etc. Doing lots of stupid little tests to verify my understanding. Here is one which boggles me (source code further below).
In short, I am generating N random integers and adding them to a list. The list is NOT sorted. I then use Collections.contains()
to look for a value in the list. I intentionally look for a value that I know won't be there, because I want to ensure that the entire list space is probed. I time this search.
I then do another linear search manually, iterating through each element of the list and checking if it matches my target. I also time this search.
On average, the second search takes 33% longer than the first one. By my logic, the first search must also be linear, because the list is unsorted. The only possibility I could think of (which I immediately discard) is that Java is making a sorted copy of my list just for the search, but (1) I did not authorize that usage of memory space and (2) I would think that would result in MUCH more significant time savings with such a large N.
So if both searches are linear, they should both take the same amount of time. Somehow the Collections class has optimized this search, but I can't figure out how. So... what am I missing?
import java.util.*;
public class ListSearch {
public static void main(String[] args) {
int N = 10000000; // number of ints to add to the list
int high = 100; // upper limit for random int generation
List<Integer> ints;
int target = -1; // target will not be found, forces search of entire list space
long start;
long end;
ints = new ArrayList<Integer>();
start = System.currentTimeMillis();
System.out.print("Generating new list... ");
for (int i = 0; i < N; i++) {
ints.add(((int) (Math.random() * high)) + 1);
}
end = System.currentTimeMillis();
System.out.println("took " + (end-start) + "ms.");
start = System.currentTimeMillis();
System.out.print("Searching list for target (method 1)... ");
if (ints.contains(target)) {
// nothing
}
end = System.currentTimeMillis();
System.out.println(" Took " + (end-start) + "ms.");
System.out.println();
ints = new ArrayList<Integer>();
start = System.currentTimeMillis();
System.out.print("Generating new list... ");
for (int i = 0; i < N; i++) {
ints.add(((int) (Math.random() * high)) + 1);
}
end = System.currentTimeMillis();
System.out.println("took " + (end-start) + "ms.");
start = System.currentTimeMillis();
System.out.print("Searching list for target (method 2)... ");
for (Integer i : ints) {
// nothing
}
end = System.currentTimeMillis();
System.out.println(" Took " + (end-start) + "ms.");
}
}
EDIT: Below is a new version of this code. The interesting thing is that now my manual linear loop performs 16% faster than the contains
method (NOTE: both are designed to intentionally search the entire list space, so I know they are equal in number of iterations). I can't account for this 16% gain... more confusion.
import java.util.*;
public class ListSearch {
public static void main(String[] args) {
int N = 10000000; // number of ints to add to the list
int high = 100; // upper limit for random int generation
List<Integer> ints;
int target = -1; // target will not be found, forces search of entire list space
long start;
long end;
ints = new ArrayList<Integer>();
start = System.currentTimeMillis();
System.out.print("Generating new list... ");
for (int i = 0; i < N; i++) {
ints.add(((int) (Math.random() * high)) + 1);
}
end = System.currentTimeMillis();
System.out.println("took " + (end-start) + "ms.");
start = System.currentTimeMillis();
System.out.print("Searching list for target (method 1)... ");
if (ints.contains(target)) {
System.out.println("hit");
}
end = System.currentTimeMillis();
System.out.println(" Took " + (end-start) + "ms.");
System.out.println();
ints = new ArrayList<Integer>();
start = System.currentTimeMillis();
System.out.print("Generating new list... ");
for (int i = 0; i < N; i++) {
ints.add(((int) (Math.random() * high)) + 1);
}
end = System.currentTimeMillis();
System.out.println("took " + (end-start) + "ms.");
start = System.currentTimeMillis();
System.out.print("Searching list for target (method 2)... ");
for (int i = 0; i < N; i++) {
if (ints.get(i) == target) {
System.out.println("hit");
}
}
end = System.currentTimeMillis();
System.out.println(" Took " + (end-start) + "ms.");
}
}
Your comparison code is buggy, and this is distorting your results.
This does search for the target
if (ints.contains(target)) {
// nothing
}
But this does not!
for (Integer i : ints) {
// nothing
}
You are actually just iterating over the list elements without testing them.
Having said that, the second version is slower than the first version for one or more of the following reasons:
The first version will be iterating the backing array using a simple for
loop and an index. The second version is equivalent to the following:
Iterator<Integer> it = ints.iterator();
while (it.hasNext()) {
Integer i = (Integer) it.next();
}
In other words, each time around the loop involves 2 method calls and a typecast1.
The first version will return true
as soon as it gets a match. Because of the bug in your implementation, the second version will iterate the entire list every time. In fact, given the choice of N
and high
, this effect is the most likely the main cause of the difference in performance.
1 - Actually, it is not entirely clear what the JIT compiler will do with all of this. It could in theory inline the method calls, deduce that the typcaset is unnecessary, or even optimize away the entire loop. On the other hand, there are factors that may inhibit these optimizations. For instance, ints
is declared as List<Integer>
which could inhibit inlining ... unless the JIT is able to deduce that the actual type is always the same.
Your results are possibly also distorted for another reason. Your code does not take account of JVM warmup. Read this Question for more details: How do I write a correct micro-benchmark in Java?