Search code examples
solrsearch-engineelasticsearchlinkedin-api

Partial Update of documents


We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be a. add new fields b. update the content of existing fields. Some of the fields in our schema are stored, others are not.

SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/

Questions: 1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.

  1. On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?

For now, we manage by re-indexing the affected documents when they need to be indexed

Thanks


Solution

  • Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.

    Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.

    Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:

    1. Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
    2. Executing a script on the existing document and indexing the result after deleting the old one

    Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.