Search code examples
solrlucenesolr6

How to search the field which could contains spaces,- and a concatenated number.?


Hi I have a field with the following schema,

  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>

I am storing complete pdf documents.

Now suppose I have 4 documents with the following content.

1. stackoverflow is a good site.
2. stack-overflow is a good site.
3. stack overflow is a good site.
4. stackoverflow2018 is a good site. 

Now when I search stackoverflow It should return me 1, when I search stack-overflow it should return me 2. when I search stack overflow it should return me 3. when I search stackoverflow2018 it should return me 4.

what should the schema for it the schema not working in this case. Is there any thing I could specify in the query ?


Solution

  • A Word Delimiter Graph Filter will split on non-alphanumerics (-), case changes, and numbers by default.

    The rules for determining delimiters are determined as follows:

    A change in case within a word: "CamelCase" -> "Camel", "Case". This can be disabled by setting splitOnCaseChange="0".

    A transition from alpha to numeric characters or vice versa: "Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be disabled by setting splitOnNumerics="0".

    Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"

    A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"

    Any leading or trailing delimiters are discarded: "--hot-spot--" -> "hot", "spot"

    If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.