I'd like to know how dspace manage indexing in both the database and solr while supporting concurrency. In other words, if 2 individuals try to write at the same time, on the same item (e.g. changing metadata), how do dspace ensure that the index will not be desynchronized with the database.
This can happen if USER1 write concurrently with USER 2 on the same metadata value, and the write to the database of USER 1 first happen, but then the Write to the database and the Index of USER2 happen, and then the write to the index of USER1 Happen.
In other words USER1 "write" will be in the index while User2 write will be in the database = inconsistency !!!
I wonder how this case can be avoid in dspace, which is a typical dual write problem.
With the Event system of dspace, i don't know how this can be avoided.
Does anyone knows?
In Solr, DSpace doesn't index just the single metadata change (when it occurs). It actually reindexes the entire Item in Solr.
What this means is that while concurrency is an issue in the Database layer (and writes/updates are synchronized in the database), it is not one in the Solr indexing process.
Here's what would/should happen in your example.
So, the simple answer here is that DSpace doesn't reindex individual modifications (which could end up out of order if not synchronized with the DB edits). Instead, it tracks which objects have been updated and triggers a reindex of the entire object's metadata. While this may seem like "overkill", the reindex of a single object in Solr is not all that process intensive, and it ensures that the object's current/latest metadata is indexed in Solr (in the case of simultaneous writes).
UPDATE: As requested (in comments below), here's how DSpace performs reindexing (in Solr) in much more detail.
dspace.cfg
in this section: https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace/config/dspace.cfg#L732IndexEventConsumer
is what performs indexing for Solr. It is defined configured by default here: https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace/config/dspace.cfg#L732Item.update()
method is called to actually save the changes back to the database layer.DatabaseManager.update()
), the Item.update()
method generates a new MODIFY event in the Event System.BasicDispatcher
is configured by default in dspace.cfg), which then in turn triggers the index in Solr (via the configured IndexEventConsumer
)IndexEventConsumer
passes the list of update objects (in this case an Item) to the IndexingService (SolrServiceImpl
by default).SolrServiceImpl.indexContent()
reads the latest metadata value(s) from the Database and indexes them in Solr.The above logic is still a bit simplified (as it'd be way too complex to walk through every step of the code). But, the basic gist here is that each Item.update()
call is treated as a database transaction. It also triggers the addition of a MODIFY event which is stored in the user's session (Context object). As soon as the DB transaction is committed, the MODIFY event is processed by the IndexEventConsumer
which reindexes the entire Item.
So, in the case of simultaneous edits, two MODIFY events will be generated (one for each edit). However, the last MODIFY event will not be triggered until after the last database edit is committed. Therefore, the Solr index should always be in sync with the latest info in the Database.