I am doing a comparison between two lists of objects using streams (filter
/anyMatch
). The size of the two lists can be up to a million objects.
I ran a test with the code below. Often the size of the two lists is close.
How can I improve the response time of the code below?
List<String> listDifferences = list1.stream()
.filter(o1 -> list2.stream()
.noneMatch(o2 -> o2.getValue().equals(o1.getValue())))
.map(ObjectValue::getValue).collect(Collectors.toList());
Time response for 699839 objects for each list:
Fri May 10 16:14:09 CEST 2024
processing ....
Fri May 10 16:33:30 CEST 2024
The fact that you focused the question on 'streams' / 'filter' lightly suggests you think futzing about with how you iterate through collections somehow has a measurable effect on performance. Or, worse, that you think streams are 'just better', specifically, 'faster'.
That's all wrong.
It's not about how you iterate through this stuff. It's all equally fast (or, in this case, equally slow). And streams are, if anything, slower. It's about readability (they are also often less readable. It's a tool in the toolbox. Use it when appropriate. Often, it's not appropriate. If you think "Streams.. those are.. better... because... they are streams!" - that's a highly detrimental thought process. Good code is good because it is readable/understandable (meaning: A reader who isn't intimately familiar will draw a conclusion about what it does and how more quickly than alternative ways to write it, and this conclusion is correct), easily tested, and flexible (in the face of expectable change requests, it is easily adapted to fulfill these requests).
if writing it with streams results in better 'scores' for these factors, then streams are better. And if not, then they aren't. It is never really about 'this style is better than that style', for any style. Certainly not 'use streams'.
What you're actually looking for has nothing to do with streams. It has to do with how the data structures are designed. You are looking for the concept of algorithmic complexity. Generally stated in terms of 'big-O notation' - you might have heard that term.
Given an algorithm that operates on an unknown but countable amount of entries, let's process so many entries, the performance characteristic of how long it takes (or how much memory is consumed to do the job) starts forming a mathematical relationship relative to how many entries we process. big-O notation expresses that mathematical relationship. It simply relates 'assuming a big enough input set, how much longer do things take when you grow or shrink that input set in relative terms'. It makes no claims about CPU-specific performance and in fact, the way computers work, for a given algorithm you can provide a specific mathematical relationship that holds for any architecture and any OS.
For example, your algorithm is O(n*m)
, where n
stands for the size of the input list 1 and m
is input list 2. To simplify matters we could say that both lists are assumed to be about equal in size, making that O(n^2)
.
In other words, if, say, you find that at n=10,000
, it takes 10 seconds, then for n=20,000
the expectation is that it'll take 10*10 = 100 seconds or so. The time it takes grows squared as the amount of elements you process grows linearly. That's not great.
It doesn't matter how you process the data (with for loops or lists). It doesn't matter on what CPU you run this, or in what programming language you write it. The characteristic of O(n^2)
is a characteristic of this algorithm.
The fix then, is to use a different algorithm!
This is orders of magnitude faster, because it has a very different performance characteristic - its characteristic is O(n+m)
which boils down to simply O(n)
if we say that the lists are about equally large. Because algorithmic complexity is about relative measures, constant factors are irrelevant (thus, O(2n)
is simply O(n)
).
That's because first we add all elements in list 1 to a set (that takes O(n)
time - doubling the input size doubles the time processed; in contrast to your algorithm where doubling the input size exponentially raises the processing time). Then, for each element in the second list we ask the set if it contains that element which is O(1)
. Thus, this operation is O(m)
. Run em both in sequence and we get to O(n+m)
which simplifies to O(n)
:
var set1 = new HashSet<String>(list1); // O(n)
var set2 = new HashSet<String>(list2); // O(m)
set1.removeAll(set2); // O(n)
... and I gues if you must have your answer in list form:
return new ArrayList<String>(set1); // O(n)
That's a bunch of O(n)
s in a sequence so just O(n)
in total.
Note that none of this uses streams. You can if you want to. It makes no difference. Other than that it would become less readable.
Your code actually uses a sub-property (.getValue()
), not the object itself. This is no particular issue.
var set1 = new HashSet<String>();
for (MyObj o1 : list1) set1.add(o1.getValue());
var set2 = new HashSet<String>();
for (MyObj o2 : list2) set2.add(o2.getValue());
set1.removeAll(set2);
return new ArrayList<String>(set1);
NB: Sets have O(1)
performance as long as its elements have proper hash distribution. There is no reason to think you'll run into troubles in this regard for this question.