Search code examples
solrdih

Efficiency aspect of delta import in solr


I have data of about 2100000 rows. The time taken for full-import is about 2 minutes. For any updates in table I'm using delta import to index the updates. The time taken for delta import is 6 minutes.

Considering the efficiency aspect it is better to do full import rather than delta import. So, what is the need of delta import? Is there any better way to use delta import to increase it's efficiency?

I followed the steps in documentation.

data-config.xml

<dataConfig>
<dataSource type="JdbcDataSource" driver="com.dbschema.CassandraJdbcDriver" url="jdbc:cassandra://127.0.0.1:9042/test" autoCommit="true" rowLimit = '-1' batchSize="-1"/>
<document name="content">
    <entity name="test" query="SELECT * from person" deltaImportQuery="select * from person where seq=${dataimporter.delta.seq}" deltaQuery="select seq from person where last_modified &gt; '${dataimporter.last_index_time}' ALLOW FILTERING" autoCommit="true">
        <field column="seq" name="id" />
        <field column="last" name="last_s" />
        <field column="first" name="first_s" />
        <field column="city" name="city_s" />
        <field column="zip" name="zip_s" />
        <field column="street" name="street_s" />
        <field column="age" name="age_s" />
        <field column="state" name="state_s" />
        <field column="dollar" name="dollar_s" />
        <field column="pick" name="pick_s" />
    </entity>
</document>


Solution

  • The usual way of setting up delta indexing (like you did), runs 2 queries instead of a single one. So in some cases it might not be optimal.

    I prefer to setup delta like this, so there is single query to maintain, it's cleaner, and delta runs in a single query. You should try it, it might improve things. The downside is the deletes, you either do some soft-deleting or you still need the usual delta configuration for that (I favour the first).

    Also, of course, make sure the last_modified column is properly indexed. I am not familiar with Cassandra jdbc driver, you should double check.

    Last thing, if you are using Datastax Entreprise Edition, you can query it via Solr if you configured for that. In this case you could also try indexing off SolrEntityProcessor and with some request param trick you can do full and delta indexing too. I used it succesfully in the past.