Search code examples
solrindexingtrimdataimporthandler

Trimming fields when adding documents to Solr


I'm using the DataImportHandler from Solr to index certain data from a database. However, the database table scheme uses CHAR-fields, so they have a fixed width and have some trailing spaces.

I'm trying to remove these trailing spaces (trimming them) by using the solr.TrimFilterFactory. In my Solr schema.xml I'm using the following field type to index the data:

<fieldType name="string" class="solr.TextField" sortMissingLast="true" omitNorms="true">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.TrimFilterFactory" updateOffsets="true" />
    </analyzer>
</fieldType>

So now I'm adding a document like:

<add>
    <doc>
        <field name="test">Test       </field>
    </doc>
</add>

And I'm expecting that the trailing spaces from the test-field are removed, but when I query for: test:Test*, I get:

<?xml version="1.0" encoding="UTF-8"?>
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">0</int>
    </lst>
    <result name="response" numFound="1" start="0">
        <doc>
            <str name="test">Test       </str>
        </doc>
    </result>
</response>

So as you can see, the trailing spaces are not removed. I must be doing something wrong or misunderstood the concept of filters. But my expectation was that the query would return:

<?xml version="1.0" encoding="UTF-8"?>
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">0</int>
    </lst>
    <result name="response" numFound="1" start="0">
        <doc>
            <str name="test">Test</str>
        </doc>
    </result>
</response>

So my question is how I can make sure that when indexing these documents, all trailing spaces get removed.


Solution

  • Solr analyzers/filters do not modify the stored value.
    Only the indexed value would be modified.
    So the TrimFilterFactory does not change the stored value and would return the same value as input.

    If using DIH, Check ScriptTransformer to modify the value before it is fed to Solr.