Is it suboptimal to add the same datoms multiple times?

I'm currently using Datomic in one of my project, and a question is bothering me.

Here is a simplified version of my problem:

I need to parse a list of small English sentences and insert both the full sentence and its words into Datomic.
the file that contains the list of sentences is quite big (> 10 GB)
the same sentence can occur multiple times in the file, and their words can also occur multiple times across sentences
during the insertion process, an attribute will set to associate each sentence with its corresponding words

To ease the insertion process, I'm tempted to write the same datoms multiple times (i.e. not check if a record already exists in the database). But I'm afraid about the performance impact.

What happens in Datomic when the same datoms are added multiple times ?
Is it worth checking that a datom has already been added prior to the transaction ?
Is there a way to prevent Datomic from overriding previous datoms (i.e if a record already exists, skip the transaction) ?

Thank you for your help

Solution

What happens in Datomic when the same datoms are added multiple times ?

Is it worth checking that a datom has already been added prior to the transaction ?

Logically, a Datomic database is a sorted set of datoms, so adding the same datom several times is idempotent. However, when you're asserting a datom with a tempid, you may create a new datom for representing the same information as an old datom. This is where :db/unique comes in.

To ensure an entity does not get stored several times, you want to set the :db/unique attribute property to :db.unique/identity for the right attributes. For instance, if your schema consists of 3 attributes :word/text, :sentence/text, and :sentence/words, then :word/text and :sentence/text should be :db.unique/identity, which yields the following schema installation transaction:

[{:db/cardinality :db.cardinality/one,
  :db/fulltext true,
  :db/index true,
  :db.install/_attribute :db.part/db,
  :db/id #db/id[:db.part/db -1000777],
  :db/ident :sentence/text,
  :db/valueType :db.type/string,
  :db/unique :db.unique/identity}
 {:db/cardinality :db.cardinality/one,
  :db/fulltext true,
  :db/index true,
  :db.install/_attribute :db.part/db,
  :db/id #db/id[:db.part/db -1000778],
  :db/ident :word/text,
  :db/valueType :db.type/string,
  :db/unique :db.unique/identity}
 {:db/cardinality :db.cardinality/many,
  :db/fulltext true,
  :db/index true,
  :db.install/_attribute :db.part/db,
  :db/id #db/id[:db.part/db -1000779],
  :db/ident :sentence/words,
  :db/valueType :db.type/ref}]

Then the transaction for inserting inserting looks like:

[{:sentence/text "Hello World!"
  :sentence/words [{:word/text "hello"
                    :db/id (d/tempid :db.part/user)}
                   {:word/text "world"
                    :db/id (d/tempid :db.part/user)}]
  :db/id (d/tempid :db.part/user)}]

Regarding performance:

You may not need to optimize at all, but in my view, the potential performance bottlenecks of your import process are:

time spent building the transaction in the Transactor (which includes index lookups for unique attributes etc.)
time spent building the indexes.

To improve 2.: When the data you insert is sorted, indexing is faster, so an would be to insert words and sentences sorted. You can use Unix tools to sort large file even if they don't fit in memory. So the process would be:

sort sentences, insert them (:sentence/text)
extract words, sort them, insert them (:word/text)
insert word-sentence relationship (:sentence/words)

To improve 1.: indeed, it could put less pressure on the transactor to use entity ids for words that are already stored instead of the whole word text (which requires an index lookup to ensure uniqueness). One idea could be to perform that lookup on the Peer, either by leveraging parallelism and/or only for frequent words (For instance, you could insert the words from the 1st 1000 sentences, then retrieve their entity ids and keep them in a hash map).

Personally, I would not go through these optimizations until experience has shown they're necessary.