Search code examples
pythondjangoelasticsearchdjango-haystack

django-haystack elasticsearch multiple indexes wrong results


I have set up in my site a search index using django-haystack + elasticsearch. There are different indexes and in general they work correctly, except of ProjectIndex when I search for a Person. Let me explain:

Those are the models:

class Person(models.Model):
    first_name = models.CharField()
    last_name = models.CharField()

class Project(models.Model):
    project_name = models.CharField()
    employees = models.ManyToManyField(Person)

And those are the indexes:

class ProjectIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.EdgeNgramField(document=True, use_template=True)
    project_name = indexes.CharField(model_attr='project_name', boost=1.4)
    employees = indexes.CharField(boost=1.5)

    def get_model(self):
        return Project

    def prepare_employees(self, obj):
        return ' '.join([employee.__unicode__() for employee in obj.employees.all()])

    def prepare(self, obj):
        data = super(ProjectIndex, self).prepare(obj)
        data['boost'] = 1.3
        return data

class PersonIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.EdgeNgramField(document=True, use_template=True)
    first_name = indexes.CharField(model_attr='first_name', boost=1.1)
    last_name = indexes.CharField(model_attr='last_name', boost=1.2)

    def get_model(self):
        return Person

When I run the rebuild_index all projects seem to be correctly indexed. A simple http query to the elasticsearch server when I search for a person returns useful project results (really meaningful).

>>> from urllib import urlopen
>>> response = urlopen('http://127.0.0.1:9200/_search?q=' + q)
>>> json_data = response.read()
>>> from json import loads
>>> d = loads(json_data)
>>> f = filter(lambda d: d['_source']['django_ct'] == "project.project", d['hits']['hits'])
>>> len(f)
8

On the other hand, the SearchQuerySet is returning just 3 projects in which this person is not involved, nor this persons name is similar to the project name.

>>> sqs = SearchQuerySet().filter(content__auto=q)
>>> sqs.count()
8
>>> sqs.models(Project)
[<SearchResult: project.project (pk=u'409')>, <SearchResult: project.project (pk=u'521')>, <SearchResult: project.project (pk=u'82')>]

Am I doing anything wrong here? Thanks for your feedback :-)


Solution

  • I have managed to get much better quality results by modifying my filter() query. I learned on the django-haystack documentation about the filter_or() method, which it's possible to concatenate as many times as needed. This way it's easy to match as many fields and/or indexes as needed. For example:

    sqs = SearchQuerySet().filter_or(content__contains=q).filter_or(employees__name__contains=q).filter_or(projects_name__contains=q)
    # etc.
    

    As just said, this has improved a lot the quality on the search results.