Clojure Set vs Map Lookup Performance Differences

I have a list of uids and want to check if a uid is a member of this list

The natural way to implement it would be to create a set (clojure.set) of uids and search for that member on that list

What I found out is that map key lookup is a lot faster - I used the following snippet to benchmark both methods:

(def uids #{:a :b :c :d :e :f :g :h :i :j :k :l :m :n :o :p :a1 :b1 :c1 :d1 :e1 :f1 :h1 :i1 :j1 :k1 :l1 :m1 :n1 :o1 :p1})
(def uids-map (reduce (fn [acc v] (assoc acc v true)) {} uids))
(time (dotimes [i 1000000] (:o1 uids)))
;user=> "Elapsed time: 191.076266 msecs"
(time (dotimes [i 1000000] (:o1 uids-map)))
;user=> "Elapsed time: 38.159388 msecs"

the results were very consistent across invocations - map lookup took about 1/5 of set lookup

So is set not optimal for key lookup or am I using it the wrong way?

Also, what are the reasons for the differences in these benchmarks?

I was under the impression that a set is implemented in clojure as an associative data structure similar to vectors - so why is key lookup significantly slower than a simple map?

Solution

I've never went into clojure's source but from what I see the set implementation actually uses a map inside:

protected APersistentSet(IPersistentMap impl){
    this.impl = impl;
}

It also delegates the invoke call to the internal map.

In APersistentSet:

public Object invoke(Object arg1) {
    return get(arg1);
}

// ....

public Object get(Object key){
    return impl.valAt(key);
}

In APersistentMap:

public Object invoke(Object arg1) {
    return valAt(arg1);
}

public Object invoke(Object arg1, Object notFound) {
    return valAt(arg1, notFound);
}

So this can't explain the difference.

As mentioned in the comments by @cgrand, when we reverse the arguments its faster (and about the same, since we call set's invoke immediately). So I looked up Keyword's invoke which is what is probably used for (:k obj):

final public Object invoke(Object obj, Object notFound) {
    if(obj instanceof ILookup)
        return ((ILookup)obj).valAt(this,notFound);
    return RT.get(obj, this, notFound);
}

The important thing to notice is that ILookup is implemented in APersistentMap (through Associative) but not in APersistentSet. You can also verify in clojure:

(instance? clojure.lang.ILookup #{}) ;; false
(instance? clojure.lang.ILookup {})  ;; true

So maps go through the "happy path" and sets end up in RT.get which I believe is the runtime.

Lets have a look at the runtime.

It Initially attempts to do practically the same thing as keyword:

static public Object get(Object coll, Object key){
    if(coll instanceof ILookup)
        return ((ILookup) coll).valAt(key);
    return getFrom(coll, key);
}

But since we know sets do not implement ILookup we know they go to RT.getFrom:

static Object getFrom(Object coll, Object key){
    if(coll == null)
        return null;
    else if(coll instanceof Map) {
        Map m = (Map) coll;
        return m.get(key);
    }
    else if(coll instanceof IPersistentSet) {
        IPersistentSet set = (IPersistentSet) coll;
        return set.get(key);
    }
    else if(key instanceof Number && (coll instanceof String || coll.getClass().isArray())) {
        int n = ((Number) key).intValue();
        if(n >= 0 && n < count(coll))
            return nth(coll, n);
        return null;
    }
    else if(coll instanceof ITransientSet) {
        ITransientSet set = (ITransientSet) coll;
        return set.get(key);
    }

    return null;
}

Which leads me to believe the main difference is the extra delegations and instanceof calls due to sets not implementing ILookup.

As bonus I've added a test on a set that implements ILookup and delegates valAt to the internal map immediately (using proxy) which closed the gap a bit:

(def uids #{:a :b :c :d :e :f :g :h :i :j :k :l :m :n :o :p :a1 :b1 :c1 :d1 :e1 :f1 :h1 :i1 :j1 :k1 :l1 :m1 :n1 :o1 :p1})
(def uids-map (into {} (for [k uids] [k k])))
(def lookupable-set (proxy [clojure.lang.APersistentSet clojure.lang.ILookup] [uids-map]
                      (valAt [k] (get uids-map k))))

;; verify
(instance? clojure.lang.APersistentSet lookupable-set) ;; true
(instance? clojure.lang.ILookup lookupable-set) ;; true

(time (dotimes [i 1000000] (:o1 uids))) ;; 134.703101 msecs
(time (dotimes [i 1000000] (:o1 lookupable-set))) ;; 63.187353 msecs  <-- faster
(time (dotimes [i 1000000] (:o1 uids-map))) ;; 35.802762 msecs <-- still fastest

To conclude: Where performance matters - invoking the set (#{...} k) without going through keyword (k #{...}) is as fast as map.

But I could be wrong :)