Search code examples
pythondjangoelasticsearchdjango-haystack

Enabling Synonyms in Django Haystack with Elasticsearch Backend


I am having issues using a synonym filter within hackstack using an Elasticsearch custom backend.

All I am trying to do at this point is to create one synonym for testing purposes. I would like to add a synonym pairing for the word 'tricklenutz' to the word 'lipstick'.

I am using the following custom haystack backend:

from django.conf import settings
from haystack.backends.elasticsearch_backend import (ElasticsearchSearchBackend,
    ElasticsearchSearchEngine)

class SiteElasticBackend(ElasticsearchSearchBackend):

    def __init__(self, connection_alias, **connection_options):
        super(SiteElasticBackend, self).__init__(
                                connection_alias, **connection_options)
        MY_SETTINGS = {
            'settings': {
                "analysis": {
                    "analyzer": {
                        "synonym_analyzer": {
                            "type": "custom",
                            "tokenizer": "lowercase",
                            "filter": ["synonym"]
                        },
                        "ngram_analyzer": {
                            "type": "custom",
                            "tokenizer": "lowercase",
                            "filter": ["haystack_ngram", "synonym"]
                        },
                        "edgengram_analyzer": {
                            "type": "custom",
                            "tokenizer": "lowercase",
                            "filter": ["haystack_edgengram", "synonym"]
                        }
                    },
                    "tokenizer": {
                        "haystack_ngram_tokenizer": {
                            "type": "nGram",
                            "min_gram": 3,
                            "max_gram": 15,
                        },
                        "haystack_edgengram_tokenizer": {
                            "type": "edgeNGram",
                            "min_gram": 2,
                            "max_gram": 15,
                            "side": "front"
                        }
                    },
                    "filter": {
                        "synonym": {
                            "type": "synonym",
                            "synonyms": [
                                "tricklenutz, lipstick"
                            ]
                        },
                        "haystack_ngram": {
                            "type": "nGram",
                            "min_gram": 3,
                            "max_gram": 15
                        },
                        "haystack_edgengram": {
                            "type": "edgeNGram",
                            "min_gram": 5,
                            "max_gram": 15
                        }
                    }
                }
            }
        }
        setattr(self, 'DEFAULT_SETTINGS', MY_SETTINGS)


class ConfigurableElasticSearchEngine(ElasticsearchSearchEngine):
    backend = SiteElasticBackend

As you can see, I am just trying to create a synonym for 'lipstick' to 'tricklenutz' (a word that does not show up in any searches).

I have the following entry in my settings.py file:

HAYSTACK_CONNECTIONS = { 'default': { 'ENGINE': 'search.backends.site_elasticsearch_backend.ConfigurableElasticSearchEngine', 'URL': 'http://127.0.0.1:9200/', 'INDEX_NAME': 'sitename' }, }

Here is my search_index.py for a Brand:

class BrandIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    ngram_text = indexes.EdgeNgramField()
    name = indexes.NgramField(model_attr='name')
    brand_name = indexes.CharField(model_attr='name')
    created_date = indexes.DateTimeField(model_attr='created_date')

    def get_model(self):
        return Brand

    def prepare(self, obj):
            """Add the content of text field from final prepared data into ngram_text field
            """
            prepared_data = super(BrandIndex, self).prepare(obj)
            prepared_data['ngram_text'] = prepared_data['text']
            return prepared_data

    def index_queryset(self, using=None):
        """Used when the entire index for model is updated."""
        return Brand.objects.filter(created_date__lte=datetime.datetime.now())

And here is the view portion for the search:

class BrandListSearchResults(ListSearchResultsViewMixin, BrandListBase):
    template_name = 'search/brand/search.html'
    page_template = 'search/brand/page.html'
    paginate_by = 50
    paginate_by_first = 50

    def get_queryset(self):
        return self.get_sqs().filter(text=self.search_term)

    def get_context_data(self, **kwargs):
        data = super(BrandListSearchResults, self).get_context_data(**kwargs)
        meta = Meta(
            title='All brands matching the search term %s' % self.search_term,
            description='Brand search results for %s' % self.search_term
        )
        data['meta'] = meta
        data['paginate_by'] = self.paginate_by
        data['paginate_by_first'] = self.paginate_by_first
        data['size_list'] = ["90","110","185"]
        return data

I have re-run my indexing but the synonym does not appear to be working.

Is there a way that I can query Elasticsearch to see if the synonyms actually exist? The haystack manage command is not very verbose about what it is doing with custom filters, etc.

Update

I have been able to query my settings directly from elasticsearch and I see that the synonyms are there:

curl -XGET 'http://localhost:9200/sitename/_settings?pretty'
{
  "sitename" : {
    "settings" : {
      "index" : {
        "creation_date" : "1427470212556",
        "uuid" : "6eznekoORQKqwswTq1G24w",
        "analysis" : {
          "analyzer" : {
            "synonym_analyzer" : {
              "type" : "custom",
              "filter" : [ "synonym" ],
              "tokenizer" : "lowercase"
            },
            "ngram_analyzer" : {
              "type" : "custom",
              "filter" : [ "haystack_ngram", "synonym" ],
              "tokenizer" : "lowercase"
            },
            "edgengram_analyzer" : {
              "type" : "custom",
              "filter" : [ "haystack_edgengram", "synonym" ],
              "tokenizer" : "lowercase"
            }
          },
          "filter" : {
            "haystack_ngram" : {
              "type" : "nGram",
              "min_gram" : "3",
              "max_gram" : "15"
            },
            "haystack_edgengram" : {
              "type" : "edgeNGram",
              "min_gram" : "5",
              "max_gram" : "15"
            },
            "synonym" : {
              "type" : "synonym",
              "synonyms" : [ "tricklenutz, lipstick" ]
            }
          },
          "tokenizer" : {
            "haystack_edgengram_tokenizer" : {
              "max_gram" : "15",
              "min_gram" : "2",
              "type" : "edgeNGram",
              "side" : "front"
            },
            "haystack_ngram_tokenizer" : {
              "type" : "nGram",
              "min_gram" : "3",
              "max_gram" : "15"
            }
          }
        },
        "number_of_replicas" : "1",
        "number_of_shards" : "5",
        "version" : {
          "created" : "1040399"
        }
      }
    }
  }
}

Solution

  • The first thing I notice is that you have your synonym_analyzer analyzer configured but unused! You need to either set the default analyzer or do this on a field-by-field basis (which requires additional changes in your custom backend as well as extended field classes; here's an example).

    I've encountered similar frustration about understanding how documents are actually handled from Django to ElasticSearch. You can use a combination of just hitting ElasticSearch's HTTP API and some additional introspection via Haystack. I wrote a command in the linked elasticstack package called show_mapping which shows the JSON used to create your mapping. This way you can at least see whether your fields are configured to use the analyzers you've set up.

    Short disclaimer - I have not kept up with the most recent changes in Haystack (after 2.0 or 2.1) so it's possible some of these suggestions themselves need updating.