After some researching, I was recently able to dramatically improve the performance of some code by using a set to compare rather than a vector. Here is a simple example of the initial code:
(def target-ids ["a" "b" "c"])
(def maps-to-search-through
[{"id": "a" "value": "example"}
{"id": "e" "value": "example-2"}])
(filter (fn [i] (some #(= (:id i) %) target-ids)) maps-to-search-through)
And here is the optimised code:
(def target-ids #{"a" "b" "c"})
(def maps-to-search-through
[{"id": "a" "value": "example"}
{"id": "e" "value": "example-2"}])
(filter (comp target-ids :id) maps-to-search-through)
For reference, target-ids
and maps-to-search-through
are both generated dynamically, and can contain thousands of values each -- although maps-to-search-through
will always be at least 5x larger than target-ids
.
All advice and documentation I found online suggested this improvement, specifically using a set instead of a vector, would be significantly faster, but didn't elaborate on why that is. I understand that in the initial case, filter
is doing a lot of work - iterating through both vectors on every step. But I don't understand how that isn't the case in the improved code.
Can anyone help explain?
Sets are data structures that are designed to only contain unique values. Also you can use them as functions to check whether a given value is a member of the very set - just as you use your target-ids
set. It basically boils down to a call of Set.contains
on JVM side which uses some clever hash-based logic.
Your first solution loops through the vector using some
, so it's similar to a nested for
loop which is obviously slower.