Search code examples
solrin-place

How can I do Solr Delta Import without update whole document?


I want to do Solr Delta Import, but I don't want to update the whole document. Is there a way I can instruct solr to update only certain field when do the delta import?


Solution

  • Theory

    This feature is knows as in-place update. An in-place update is performed only when the field to be updated meet these conditions:

    • non-indexed (indexed="false")
    • non-stored (stored="false")
    • single valued (multiValued="false")
    • numeric docValues (docValues="true") fields

    In other words this feature is based on a special data structure DocValues so you can not update non DocValues field without whole document reindexing. You can read more details about updatable DocValues in the following jira issues:

    Practice

    Here is an example via SolrJ:

    HttpSolrClient client = new HttpSolrClient("http://localhost:8983/solr");
    SolrInputDocument doc = new SolrInputDocument();
    doc.addField("id","1");
    Map<String,Object> fields = new HashMap<>();
    fields.put("inc", "-1");
    doc.addField("count", fields);
    client.add(doc); 
    client.close();
    

    Or via CURL:

    curl http://localhost:8983/solr/library/update -d '
    [
     {"id"         : "1",
      "count"     : {"inc":"-1"}
     }
    ]'
    

    Where field count is declared as:

    <field name="count" type="int" indexed="false" stored="false" docValues="true"/>
    

    Please note in case of wrong field configuration an "Atomic Update" will be applied.

    "Atomic Updates"

    You can "update" any field in document without any restrictions by "Atomic Updates". Atomic Update does not actually do in-place update - it deletes the old document and then indexes a new document with the update applied to it in one shot. Under the hood it requires that all fields in your schema must be configured as stored and copy fields as not stored(keep in mind nested documents) and tries to reconstruct the whole document from the stored fields. In case of any misconfiguration you will lost a huge part of document without any notification. In general atomic update has the following drawbacks:

    • Reindexing of entire documents and passing them through all analysis chains consumes a lot of CPU cycles
    • Index size is increased by storing original documents data
    • New index segments are created and old documents are marked as deleted in existing segments, causing segment merge policies to kick in and use additional CPU and build up I/O pressure
    • Most importantly, searches have to be reopened after commits make changes visible. This wipes all accumulated filter caches, document caches, and query result caches
    • Index commits, which make changes visible, wipe filter and field caches as additional segments are added to the index;
    • In the case of a block index structure, whole blocks of the documents have to be reindexed, significantly increasing overhead