I have a list of uids and want to check if a uid is a member of this list
The natural way to implement it would be to create a set (clojure.set
) of uids and search for that member on that list
What I found out is that map key lookup is a lot faster - I used the following snippet to benchmark both methods:
(def uids #{:a :b :c :d :e :f :g :h :i :j :k :l :m :n :o :p :a1 :b1 :c1 :d1 :e1 :f1 :h1 :i1 :j1 :k1 :l1 :m1 :n1 :o1 :p1})
(def uids-map (reduce (fn [acc v] (assoc acc v true)) {} uids))
(time (dotimes [i 1000000] (:o1 uids)))
;user=> "Elapsed time: 191.076266 msecs"
(time (dotimes [i 1000000] (:o1 uids-map)))
;user=> "Elapsed time: 38.159388 msecs"
the results were very consistent across invocations - map lookup took about 1/5 of set lookup
So is set not optimal for key lookup or am I using it the wrong way?
Also, what are the reasons for the differences in these benchmarks?
I was under the impression that a set is implemented in clojure as an associative data structure similar to vectors - so why is key lookup significantly slower than a simple map?
I've never went into clojure's source but from what I see the set implementation actually uses a map inside:
protected APersistentSet(IPersistentMap impl){
this.impl = impl;
}
It also delegates the invoke
call to the internal map.
In APersistentSet:
public Object invoke(Object arg1) {
return get(arg1);
}
// ....
public Object get(Object key){
return impl.valAt(key);
}
In APersistentMap:
public Object invoke(Object arg1) {
return valAt(arg1);
}
public Object invoke(Object arg1, Object notFound) {
return valAt(arg1, notFound);
}
So this can't explain the difference.
As mentioned in the comments by @cgrand, when we reverse the arguments its faster (and about the same, since we call set's invoke
immediately). So I looked up Keyword's invoke
which is what is probably used for (:k obj)
:
final public Object invoke(Object obj, Object notFound) {
if(obj instanceof ILookup)
return ((ILookup)obj).valAt(this,notFound);
return RT.get(obj, this, notFound);
}
The important thing to notice is that ILookup
is implemented in APersistentMap
(through Associative
) but not in APersistentSet
. You can also verify in clojure:
(instance? clojure.lang.ILookup #{}) ;; false
(instance? clojure.lang.ILookup {}) ;; true
So maps go through the "happy path" and sets end up in RT.get
which I believe is the runtime.
Lets have a look at the runtime.
It Initially attempts to do practically the same thing as keyword:
static public Object get(Object coll, Object key){
if(coll instanceof ILookup)
return ((ILookup) coll).valAt(key);
return getFrom(coll, key);
}
But since we know sets do not implement ILookup
we know they go to RT.getFrom
:
static Object getFrom(Object coll, Object key){
if(coll == null)
return null;
else if(coll instanceof Map) {
Map m = (Map) coll;
return m.get(key);
}
else if(coll instanceof IPersistentSet) {
IPersistentSet set = (IPersistentSet) coll;
return set.get(key);
}
else if(key instanceof Number && (coll instanceof String || coll.getClass().isArray())) {
int n = ((Number) key).intValue();
if(n >= 0 && n < count(coll))
return nth(coll, n);
return null;
}
else if(coll instanceof ITransientSet) {
ITransientSet set = (ITransientSet) coll;
return set.get(key);
}
return null;
}
Which leads me to believe the main difference is the extra delegations and instanceof
calls due to sets not implementing ILookup
.
As bonus I've added a test on a set that implements ILookup
and delegates valAt
to the internal map immediately (using proxy
) which closed the gap a bit:
(def uids #{:a :b :c :d :e :f :g :h :i :j :k :l :m :n :o :p :a1 :b1 :c1 :d1 :e1 :f1 :h1 :i1 :j1 :k1 :l1 :m1 :n1 :o1 :p1})
(def uids-map (into {} (for [k uids] [k k])))
(def lookupable-set (proxy [clojure.lang.APersistentSet clojure.lang.ILookup] [uids-map]
(valAt [k] (get uids-map k))))
;; verify
(instance? clojure.lang.APersistentSet lookupable-set) ;; true
(instance? clojure.lang.ILookup lookupable-set) ;; true
(time (dotimes [i 1000000] (:o1 uids))) ;; 134.703101 msecs
(time (dotimes [i 1000000] (:o1 lookupable-set))) ;; 63.187353 msecs <-- faster
(time (dotimes [i 1000000] (:o1 uids-map))) ;; 35.802762 msecs <-- still fastest
To conclude: Where performance matters - invoking the set (#{...} k)
without going through keyword (k #{...})
is as fast as map.
But I could be wrong :)