Search code examples
cassandrasolrlucenedatastax-enterprise

Getting weird result from Solr query


I am using Datastax 6.8. This is my SOLR schema:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
  <types>
    <fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
    <fieldType class="org.apache.solr.schema.TextField" name="NameField">
      <analyzer type="index">
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.NGramFilterFactory" maxGramSize="15" minGramSize="2"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.NGramFilterFactory" maxGramSize="15" minGramSize="2"/>
      </analyzer>
    </fieldType>
  </types>
  <fields>
    <field indexed="true" multiValued="false" name="nama" type="StrField"/>
    <field indexed="true" multiValued="false" name="nama_copy" type="NameField"/>
  </fields>
  <uniqueKey>(nama)</uniqueKey>
  <copyField dest="nama_copy" source="nama"/>
</schema>

I have this field value in a row batamindo v

Then I ran this query:

http://my_ip_address:8983/solr/search.form/select?wt=json&indent=true&fl=nama&q=nama_copy:batamindo\ v

I got very nice result

{
  "responseHeader":{
    "status":0,
    "QTime":8},
  "response":{"numFound":579,"start":0,"docs":[
      {
        "nama":"BATAMINDO V "},
      {
        "nama":"BATAMINDO V"},
      {
        "nama":"BATAMINDO V"},
      {
        "nama":"BATAMINDO V"},
      {
        "nama":"BATAMINDO V"},
      {
        "nama":"BATAMINDO V"},
      {
        "nama":"BATAMINDO V"},
      {
        "nama":"BATAMINDO V"},
      {
        "nama":"BATAMINDO V"},
      {
        "nama":"BATAMINDO V"}]
  }}

But when I ran

http://my_ip_address:8983/solr/search.form/select?wt=json&indent=true&fl=nama&q=nama_copy:batamindo\ vi

My search result is very bad

{
  "responseHeader":{
    "status":0,
    "QTime":14},
  "response":{"numFound":602,"start":0,"docs":[
      {
        "nama":"MV. VINCA"},
      {
        "nama":"MV. VINASHIP PEARL"},
      {
        "nama":"MV. VINASHIP PEARL"},
      {
        "nama":"MV. VINCENT TRADER"},
      {
        "nama":"MV. MEGHNA VICTORY"},
      {
        "nama":"MV. MEGHNA VICTORY"},
      {
        "nama":"NAVI SUNNY"},
      {
        "nama":"MV. MEGHNA VICTORY"},
      {
        "nama":"MT. GOLDEN VIOLET"},
      {
        "nama":"MT. GOLDEN VIOLET"}]
  }}

What is happening here?


Solution

  • What you are seeing is expected behaviour.

    The NGramFilterFactory class tokenises strings into grams of N size. In your case, the strings are broken up into grams of 2 to 15 characters based on your schema definition of:

            <filter class="solr.NGramFilterFactory" maxGramSize="15" minGramSize="2"/>
    

    For an input string like cassandra, the N-gram filter generates the following grams:

    • size=2 : ca as ss sa an nd dr ra
    • size=3 : cas ass ssa san and ndr dra
    • size=4 : cass assa ssan sand andr ndra
    • and so on until size=15

    For search term ss, the Solr query will get a match for ss, ass, ssa, assa, ssan and so on.

    In your case where the search term is vi, it is expected to match vinca, vinaship, vincent, victory, navi, violet and so on.

    For more information, see Document Analysis in Solr. Cheers!