I am using Solr's DataImportHandler to import data from a database. Some of the records have empty strings if there is no value for that column.
Currently the configuration I have produces Solr documents like this:
{
"x": "value",
"y": "",
"z": 2
}
However I would like to ignore all fields that have no value so that documents like this are created:
{
"x": "value",
"z": 2
}
Is there something I can define in the configuration file for the DataImportHandler that will give me my desired results?
One of the little-realized aspects of Solr is that you can plug UpdateRequestProcessor to run after the DIH. And, there are specialized URPs specifically for this problem.
So you could do something like this:
<updateRequestProcessorChain name="skip-empty">
<!-- Next two processors affect all fields - default configuration -->
<processor class="TrimFieldUpdateProcessorFactory" /> <!-- Get rid of leading/trailing spaces. Also empties all-spaces fields for next filter-->
<processor class="RemoveBlankFieldUpdateProcessorFactory" /> <!-- Delete fields with no content. More efficient and allows to query for presence/absence of field -->
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Obviously, remember to also reference this chain in the DIH's handler's definition:
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
....
<str name="update.chain">skip-empty</str>
</lst>
</requestHandler>
You can see the full list of the UpdateRequestProcessors at http://solr-start.com