Search code examples
vectorfilterclojuresetlisp

Why exactly is filtering using a set more performant than filtering using a vector?


After some researching, I was recently able to dramatically improve the performance of some code by using a set to compare rather than a vector. Here is a simple example of the initial code:

(def target-ids ["a" "b" "c"])

(def maps-to-search-through 
  [{"id": "a" "value": "example"} 
   {"id": "e" "value": "example-2"}])

(filter (fn [i] (some #(= (:id i) %) target-ids)) maps-to-search-through)

And here is the optimised code:

(def target-ids #{"a" "b" "c"})

(def maps-to-search-through
  [{"id": "a" "value": "example"} 
   {"id": "e" "value": "example-2"}])

(filter (comp target-ids :id) maps-to-search-through)

For reference, target-ids and maps-to-search-through are both generated dynamically, and can contain thousands of values each -- although maps-to-search-through will always be at least 5x larger than target-ids.

All advice and documentation I found online suggested this improvement, specifically using a set instead of a vector, would be significantly faster, but didn't elaborate on why that is. I understand that in the initial case, filter is doing a lot of work - iterating through both vectors on every step. But I don't understand how that isn't the case in the improved code.

Can anyone help explain?


Solution

  • Sets are data structures that are designed to only contain unique values. Also you can use them as functions to check whether a given value is a member of the very set - just as you use your target-ids set. It basically boils down to a call of Set.contains on JVM side which uses some clever hash-based logic.

    Your first solution loops through the vector using some, so it's similar to a nested for loop which is obviously slower.