I'm currently using Datomic in one of my project, and a question is bothering me.
Here is a simplified version of my problem:
To ease the insertion process, I'm tempted to write the same datoms multiple times (i.e. not check if a record already exists in the database). But I'm afraid about the performance impact.
Is it worth checking that a datom has already been added prior to the transaction ?
Is there a way to prevent Datomic from overriding previous datoms (i.e if a record already exists, skip the transaction) ?
Thank you for your help
- What happens in Datomic when the same datoms are added multiple times ?
- Is it worth checking that a datom has already been added prior to the transaction ?
Logically, a Datomic database is a sorted set of datoms, so adding the same datom several times is idempotent. However, when you're asserting a datom with a tempid, you may create a new datom for representing the same information as an old datom. This is where :db/unique
comes in.
To ensure an entity does not get stored several times, you want to set the :db/unique
attribute property to :db.unique/identity
for the right attributes. For instance, if your schema consists of 3 attributes :word/text
, :sentence/text
, and :sentence/words
, then :word/text
and :sentence/text
should be :db.unique/identity
, which yields the following schema installation transaction:
[{:db/cardinality :db.cardinality/one,
:db/fulltext true,
:db/index true,
:db.install/_attribute :db.part/db,
:db/id #db/id[:db.part/db -1000777],
:db/ident :sentence/text,
:db/valueType :db.type/string,
:db/unique :db.unique/identity}
{:db/cardinality :db.cardinality/one,
:db/fulltext true,
:db/index true,
:db.install/_attribute :db.part/db,
:db/id #db/id[:db.part/db -1000778],
:db/ident :word/text,
:db/valueType :db.type/string,
:db/unique :db.unique/identity}
{:db/cardinality :db.cardinality/many,
:db/fulltext true,
:db/index true,
:db.install/_attribute :db.part/db,
:db/id #db/id[:db.part/db -1000779],
:db/ident :sentence/words,
:db/valueType :db.type/ref}]
Then the transaction for inserting inserting looks like:
[{:sentence/text "Hello World!"
:sentence/words [{:word/text "hello"
:db/id (d/tempid :db.part/user)}
{:word/text "world"
:db/id (d/tempid :db.part/user)}]
:db/id (d/tempid :db.part/user)}]
You may not need to optimize at all, but in my view, the potential performance bottlenecks of your import process are:
To improve 2.
: When the data you insert is sorted, indexing is faster, so an would be to insert words and sentences sorted. You can use Unix tools to sort large file even if they don't fit in memory. So the process would be:
:sentence/text
):word/text
):sentence/words
)To improve 1.
: indeed, it could put less pressure on the transactor to use entity ids for words that are already stored instead of the whole word text (which requires an index lookup to ensure uniqueness). One idea could be to perform that lookup on the Peer, either by leveraging parallelism and/or only for frequent words (For instance, you could insert the words from the 1st 1000 sentences, then retrieve their entity ids and keep them in a hash map).
Personally, I would not go through these optimizations until experience has shown they're necessary.