Why exactly is filtering using a set more performant than filtering using a vector?

After some researching, I was recently able to dramatically improve the performance of some code by using a set to compare rather than a vector. Here is a simple example of the initial code:

(def target-ids ["a" "b" "c"])

(def maps-to-search-through 
  [{"id": "a" "value": "example"} 
   {"id": "e" "value": "example-2"}])

(filter (fn [i] (some #(= (:id i) %) target-ids)) maps-to-search-through)

And here is the optimised code:

(def target-ids #{"a" "b" "c"})

(def maps-to-search-through
  [{"id": "a" "value": "example"} 
   {"id": "e" "value": "example-2"}])

(filter (comp target-ids :id) maps-to-search-through)

For reference, target-ids and maps-to-search-through are both generated dynamically, and can contain thousands of values each -- although maps-to-search-through will always be at least 5x larger than target-ids.

All advice and documentation I found online suggested this improvement, specifically using a set instead of a vector, would be significantly faster, but didn't elaborate on why that is. I understand that in the initial case, filter is doing a lot of work - iterating through both vectors on every step. But I don't understand how that isn't the case in the improved code.

Can anyone help explain?

Solution

Sets are data structures that are designed to only contain unique values. Also you can use them as functions to check whether a given value is a member of the very set - just as you use your target-ids set. It basically boils down to a call of Set.contains on JVM side which uses some clever hash-based logic.

Your first solution loops through the vector using some, so it's similar to a nested for loop which is obviously slower.