Search code examples
djangosolrdjango-haystackdjango-oscar

After upgrading Solr 4.10 to 6.3 the search stopped working


I got a task to upgrade Solr, hovewer never worked with Solr before. Currently I have the next stack: Django 1.9.12 + Oscar 1.3 + Solr 6.3.0 + Haystack 2.5.1

I have a generated schema by Haystack, put it in the managed-schema file and modified a bit according to StackOverflow's answers, because Solr did not want to start. Now I have Solr which starts, but can not find anything via site's search field (hovewer with Solr 4.10 the search worked as expected without any problems).

In solrconfig.xml in the section below:

<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
     will be overridden by parameters in the request
  -->
<lst name="defaults">
  <str name="echoParams">explicit</str>
  <int name="rows">10</int>
</lst>

I tried to add:

<str name="df">text</str>
<str name="q.op">AND</str>

After that the search partially started to work.

Few examples:

  1. there is such item INTEL Pentium G3260 (CM8064601482506), the search works just with INTEL Pentium or CM8064601482506. If I want to find INTEL Pentium G3260 or Pentium G3260 or INTEL G3260 or G3260 - no results.

  2. Search string: AMD a8-6500; Result: Nothing to show (no results) -> should find AMD a8-6500

  3. Search string: AMD; Result: Shows all AMD products -> as expected

If I change <str name="q.op">AND</str> to <str name="q.op">OR</str>:

  1. Search string: AMD a8-6500; Result: AMD A8-6500 shows all AMD and A8-6500 -> should find just AMD a8-6500

  2. Search string: a8-6500; Result: AMD A8-6500 (AD650BOKA44HL) and INTEL Core™ i5 6500 -> should find just AMD a8-6500

My current solrconfig.xml and managed-schema at GitHub.

As the index field currently I use EdgeNgramField, i.e:

from haystack import indexes

class ProductIndexes(indexes.SearchIndex, indexes.Indexable):
    text = indexes.EdgeNgramField(
            document=True, use_template=True,
            template_name='search/indexes/cpu/item_text.txt')

How to fix\normalize searching?


Update 1: Warnings at the Dashboard's logging page

[default] default search field in schema is text. WARNING: Deprecated,&#8203; please use 'df' on request instead.
[default] query parser default operator is AND. WARNING: Deprecated,&#8203; please use 'q.op' on request instead.

can be fixed by removing

  <defaultSearchField>text</defaultSearchField>
  <solrQueryParser defaultOperator="AND"/>

from managed-schema file

Update 2: Based on Socratees's answer, here is the final changes:

  1. indexes.EdgeNgramField in the next code:

    class ProductIndexes(indexes.SearchIndex, indexes.Indexable): text = indexes.EdgeNgramField( document=True, use_template=True, template_name='search/indexes/cpu/item_text.txt')

    is changed to indexes.CharField.

  2. As I have other fields with indexes.CharField, in managed-schema I found, that these fields use type text_en, and replace fieldType name="text_en" from:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <!-- Case insensitive stop word removal.
    -->
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
        />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
      -->
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
    />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
      -->
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

which is generated by haystack, to:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StandardFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StandardFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

  1. In sorlconfig.xml the code:

<requestHandler name="/select" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <int name="rows">10</int>
  </lst>
</requestHandler>

changed to:

<requestHandler name="/select" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <int name="rows">10</int>
      <str name="df">text</str>
      <str name="q.op">AND</str>
    </lst>
</requestHandler>


Solution

  • If I want to find INTEL Pentium G3260 or Pentium G3260 or INTEL G3260 or G3260 - no results.

    This is related to how a field analyzed & tokenized. Refer documentation here.

    Tokenization using ClassicTokenizerFactory will behave like this: input: "Please, email john.doe@foo.com by 03-09, re: m37-xq." output: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

    Tokenization using solr.EdgeNGramTokenizerFactory will behave like this: input: "babaloo" output: "ba", "bab", "baba", "babal"

    In schema.xml, you can define a new fieldtype, or update the existing one like so:

    <fieldType name="text" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
      </analyzer>
    </fieldType>
    

    Play around and see which one fits your scenario. You might also want to look at how the query you give is normalized. But this is a good point to start.