I developed a small personal information directory that my client accesses and updates through a Django admin interface. That information needs to be searchable, so I set up my Django site to keep that data in a search index. I originally used Haystack and Whoosh for the search index, but I recently had to move away from those tools, and switched to Elasticsearch 5.
Previously, whenever anything in the directory was updated, the code simply cleared the entire search index and rebuilt it from scratch. There's only a few hundred entries in this directory, so that wasn't onerously non-performant. Unfortunately, attempting to do the same thing in Elasticsearch is very unreliable, due to what I presume to be a race-condition of some sort in my code.
Here's the code I wrote that uses elasticsearch-py and elasticsearch-dsl-py:
import elasticsearch
import time
from django.apps import apps
from django.conf import settings
from elasticsearch.helpers import bulk
from elasticsearch_dsl.connections import connections
from elasticsearch_dsl import DocType, Text, Search
# Create the default Elasticsearch connection using the host specified in settings.py.
elasticsearch_host = "{0}:{1}".format(
settings.ELASTICSEARCH_HOST['HOST'], settings.ELASTICSEARCH_HOST['PORT']
)
elasticsearch_connection = connections.create_connection(hosts=[elasticsearch_host])
class DepartmentIndex(DocType):
url = Text()
name = Text()
text = Text(analyzer='english')
content_type = Text()
class Meta:
index = 'departmental_directory'
def refresh_index():
# Erase the existing index.
try:
elasticsearch_connection.indices.delete(index=DepartmentIndex().meta.index)
except elasticsearch.exceptions.NotFoundError:
# If it doesn't exist, the job's already done.
pass
# Wait a few seconds to give enough time for Elasticsearch to accept that the
# DepartmentIndex is gone before we try to recreate it.
time.sleep(3)
# Rebuild the index from scratch.
DepartmentIndex.init()
Department = apps.get_model('departmental_directory', 'Department')
bulk(
client=elasticsearch_connection,
actions=(b.indexing() for b in Department.objects.all().iterator())
)
I had set up the Django signals to call refresh_index()
whenever a Department
got saved. But refresh_index()
was frequently crashing due this error:
elasticsearch.exceptions.RequestError: TransportError(400, u'index_already_exists_exception', u'index [departmental_directory/uOQdBukEQBWvMZk83eByug] already exists')
Which is why I added that time.sleep(3)
call. I'm assuming that the index hasn't been fully deleted by the time DepartmentIndex.init()
is called, which was causing the error.
My guess is that I've simply been going about this in entirely the wrong way. There's got to be a better way to keep an elasticsearch index up-to-date using elasticsearch-dsl-py, but I just don't know what it is, and I haven't been able to figure it out through their docs.
Searching for "rebuild elasticsearch index from scratch" on google gives loads of results for "how to reindex your elasticsearch data", but that's not what I want. I need to replace the data with new, more up-to-date data from my app's database.
Maybe this will help: https://github.com/HonzaKral/es-django-example/blob/master/qa/models.py#L137-L146
Either way you want to have 2 methods: batch loading all of your data into new index (https://github.com/HonzaKral/es-django-example/blob/master/qa/management/commands/index_data.py) and, optionally, a synchronization using methods/or signals as mentioned above.