I'm currently evaluating Datomic for the use-case of storing and querying parsed symbols that form an ontology. In total there are 225122 symbols (entities) in the database (so it's a rather big ontology, but shouldn't be a big deal for a DB).
The structure is pretty standard, symbols have
To have nice access to the symbols, we have a unique name
for each symbol. This adds up to the following Datomic schema:
[{:db/ident :ml/name,
:db/valueType :db.type/string,
:db/cardinality :db.cardinality/one,
:db/unique :db.unique/identity}
{:db/ident :ml/parent,
:db/valueType :db.type/ref,
:db/index true,
:db/cardinality :db.cardinality/one}
{:db/ident :ml/superclass,
:db/valueType :db.type/ref,
:db/index true,
:db/cardinality :db.cardinality/one}]
Now I have the most basic recursive query "give me all symbols that are (transitively) contained in symbol p
". In Datomic terms:
(def rules
'[
[(ubersymbol ?c ?p) (?c :ml/parent ?p)]
[(ubersymbol ?c ?p) (?c :ml/parent ?c1) (ubersymbol ?c1 ?p) ]
])
(q '[:find ?c ?n :in $ % :where
(ubersymbol ?c ?d) [?d :ml/name "name of a root symbol"] [?c :ml/name ?n]]
current-db rules)
The query itself (so a mid-sized symbol) takes between 5 and 5.5 seconds and returns 80 hits. Not milliseconds, but real seconds. And this is only the most basic query I want to ask about the dataset (it's intended to be used from a web-tool to help modellers understand the structure of the ontology).
I'm running datomic-pro-0.9.5554
, with a memory database and use the peer library (I started the server as described in the "getting started" guide.
Help is really appreciated to make a case for Datomic.
markus
As fricke
discovered by himself, it was a problem of clause ordering, but in the query, not in the ruleset. A more efficient version would be:
[:find ?c ?n :in $ % :where
[?d :ml/name "name of a root symbol"]
(ubersymbol ?c ?d)
[?c :ml/name ?n]]
The above query can be further improved by:
:ml/name
which yields:
(d/q
'[:find ?c ?n :in % $ ?d :where
(ubersymbol ?c ?d)
[?c :ml/name ?n]]
rules current-db [:ml/name "name of a root symbol"])
My theory is that your rules are not written in a way that Datalog can optimize for this read pattern - probably resulting in a traversal of all the entities. I suggest to rewrite them as follows:
[[(ubersymbol ?c ?p)
(?c :ml/parent ?p)]
[(ubersymbol ?c ?p)
;; we bind a child of the ancestor, instead of a parent of the descendant
(?c1 :ml/parent ?p)
(ubersymbol ?c ?c1)]]
This way of writing the ruleset is optimized to find the descendants of some node. The way you originally wrote it is optimized to find the anscestors of some node.
A quick benchmark on my machine with Datomic 0.9.5385 on a balanced binary tree of 50000 entities showed that you do get the desired performance with the second approach.