Search code examples
solrlucenesearch-enginesql-execution-plan

Solr: manipulate query string


How can I manipulate query strings that are sent to Solr?

For example, someone enters "stackoverflow-version1.0" but there will be no results found. However if the query was only "stackoverflow" the search would have been successful. So I want to truncate at "-" and search again for the first part.

Some research brought me to the solr.PatternReplaceCharFilterFactory class. I included it as shown below in my schema.xml. Does anyone see, why my query still does not yield any results? Any other classes I should use?

UPDATE: Now my code looks like follows:

<fieldType name="ngram" class="solr.TextField" omitNorms="true">
  <analyzer type="index">
    <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="20" />
    <filter class="solr.WordDelimiterFilterFactory"
      generateWordParts="1"
      splitOnNumerics="0"
      generateNumberParts="0"
      catenateWords="0"
      catenateNumbers="0"
      catenateAll="0"
      preserveOriginal="1"
    />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"
      generateWordParts="1"
      generateNumberParts="0"
      splitOnNumerics="0"
      catenateWords="0"
      catenateNumbers="0"
      catenateAll="0"
      preserveOriginal="1"
    />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
  </analyzer>
</fieldType>

Running the analyzer it gives me this: enter image description here

And here the query UI:enter image description here


Solution

  • You can try the WordDelimiterFilterFactory , it has many option that can be tried ...

    You can try the below field type for your field.

    <fieldtype name="subword" class="solr.TextField">
          <analyzer type="query">
              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
              <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1"
                    generateNumberParts="1"
                    catenateWords="0"
                    catenateNumbers="0"
                    catenateAll="0"
                    preserveOriginal="1"
                    />
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.StopFilterFactory"/>
          </analyzer>
          <analyzer type="index">
              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
              <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1"
                    generateNumberParts="1"
                    catenateWords="1"
                    catenateNumbers="1"
                    catenateAll="0"
                    preserveOriginal="1"
                    />
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.StopFilterFactory"/>
          </analyzer>
        </fieldtype>
    

    Here you can play around with the WordDelimiterFilterFactory

    Once the FieldType is added and applied to the field.

    Restart the server and you can analyse the input and output in the solr analysis page. On the solr analysis page it will show you how the token are generated for the input given at the time of index and query.

    This will help you to build your own custom field type as per your requirement.

    Here is link which list out all the tokenizers and filters with example. analyzers