Search code examples
solrsolrj

Solr: the query phrase returns results for some cases and doesn't for some


I get Solr results for following:

  • Sports
  • World Health Organisation
  • percent

but I don't get results for the below:

  • Sport (UK)
  • World Health Organisat
  • 1-percent

All these are in the text field which definitely contains these phrases and i have used a ngram filter on the indexer so the combination do exist. While the analysis tab of the solr UI shows me exactly what i am expecting, i am not getting the required results on my java output.

My solrj code is as below:

query.setQuery("full_text:\"World Health Organisation\"");

Also, I have to add the \".."\ as I always get errors in my front end if I remove them and half the results I otherwise get also don't turn up.

Can someone help with what I might be missing?

Much thanks!

Edit Inclusion: Definition of full_text in schema.xml

<field name="full_text" type="text_en" indexed="true" stored="false" multiValued="true"/>   
   <copyField source="title" dest="full_text"/>
   <copyField source="content" dest="full_text"/>

   <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">>
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="20"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

Solution: I figured out what the problem was. For cases of "Sports (UK)" and "1-percent", the tokeniser I was using was removing all special characters and so I have change my tokeniser. As for "World Health Organisation:, it was caused by the stemmer which changed Organisation to Organis and query like "Organisat" was kept as it is. Hence I did not get results. So I removed the stemmer as I am using a ngram filter.

Hope this helps others in the long run. :)


Solution

  • Figured out what the problem was. For cases of "Sports (UK)" and "1-percent", the tokeniser I was using was removing all special characters and so I have change my tokeniser. As for "World Health Organisation", it was caused by the stemmer which changed Organisation to Organis and query like "Organisat" was kept as it is. Hence I did not get results. So I removed the stemmer as I am using a ngram filter.