Search code examples
solrdataimporthandlersolr5

How to remove duplicates from Multivalued Fields in SOLR?


I tried the solutions listed in the below question.

Removing Solr duplicate values into multivalued field

I'm using dataimport handler and creating multiple values for the field using RegexTransformer.

My sql returns this for column FOO

Johnny Cash, Bonnie Money, Honey Bunny, Johnny Cash

and I store it to the multivalued field foo using splitBy=","

<field column="FOO" name="foo" splitBy=","/>    

and it's stored in the multivalued field as such

{"Johnny Cash", "Bonnie Money", "Honey Bunny", "Johnny Cash"}

I've added this to the solrconfig xml

  <updateRequestProcessorChain name="distinctMultiValued" default="true">
    <!-- To remove duplicate values in a multivalued field-->
    <processor class="DistributedUpdateProcessorFactory"/>
    <processor class="UniqFieldsUpdateProcessorFactory">
        <str name="fieldRegex">foo</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />        
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

I've also tried fieldName in place of fieldRegex and tried *oo instead of foo, but the duplicates remain.

Does this have something to do with the RegexTransformer?

I also have an update chain with TrimFieldsUpdateProcessorFactory than runs without any issues.


Solution

  • I was able to resolve this by moving the UniqFieldsUpdateProcessorFactory to the existing <updateRequestProcessorChain> block I had.

      <updateRequestProcessorChain name="skip-empty" default="true">
        <!--  Next two processors affect all fields - default configuration -->
        <processor class="TrimFieldUpdateProcessorFactory" />
        <processor class="RemoveBlankFieldUpdateProcessorFactory" />
        <processor class="UniqFieldsUpdateProcessorFactory">
            <str name="fieldRegex">.*oo</str>
        </processor>
        <processor class="solr.LogUpdateProcessorFactory" />
        <processor class="solr.RunUpdateProcessorFactory" />
      </updateRequestProcessorChain>
    

    SOLR documentation UpdateRequestProcessorChain

    At most one processor chain may be configured as the "default". if no processor is configured as a default, then an implicit default using LogUpdateProcessorFactory and RunUpdateProcessorFactory is created for you. Supplying a default processor chain may be the only way to affect documents indexed from some sources like the dataimport handler.