Search code examples
djangoelasticsearchcelerydjango-celeryelasticsearch-dsl-py

Elasticsearch Indexing in Django Celery Task


I’m building a Django web application to store documents and their associated metadata.

The bulk of the metadata will be stored in the underlying MySQL database, with the OCR’d document text indexed in Elasticsearch to enable full-text search. I’ve incorporated django-elasticsearch-dsl to connect and synchronize my data models, as I’m also indexing (and thus, double-storing) a few other fields found in my models. I had considered using Haystack, but it lacks support for the latest Elasticsearch versions.

When a document is uploaded via the applications’s admin interface, a post_save signal automatically triggers a Celery asynchronous background task to perform the OCR and will ultimately index the extracted text into Elasticsearch.

Seeing as how I don’t have a full-text field defined in my model (and hope to avoid doing so as I don’t want to store or search against CLOB’s in the database), I’m seeking the best practice for updating my Elasticsearch documents from my tasks.py file. There doesn’t seem to be a way to do so using django-elasticseach-dsl (but maybe I’m wrong?) and so I’m wondering if I should either:

  1. Try to interface with Elasticsearch via REST using the sister django-elasticsearch-dsl-drf package.

  2. More loosely integrate my application with Elasticsearch by using the more vanilla elasticsearch-dsl-py package (based on elasticsearch-py). I‘d lose some “luxury” with this approach as I’d have to write a bit more integration code, at least if I want to wire up my models with signals.

Is there a best practice? Or another approach I haven’t considered?

Update 1: In trying to implement the answer from @Nielk, I'm able to persist the OCR'd text (result = "test" in tasks.py below) into ElasticSearch, but it's also persisting in the MySQL database. I'm still confused about how to essentially configure Submission.rawtext as a passthru to ElasticSearch.

models.py:

class Submission(models.Model):

  rawtext = models.TextField(null=True, blank=True)
  ...
  def type_to_string(self):
    return ""

documents.py:

@registry.register_document
class SubmissionDocument(Document)

  rawtext = fields.TextField(attr="type_to_string")

  def prepare_rawtext(self, instance):
    # self.rawtext = None
    # instance.rawtext = "test"

    return instance.rawtext

  ... 

tasks.py (called on Submission model post_save signal):

  @shared_task
  def process_ocr(my_uuid)

    result = "test" # will ultimately be OCR'd text

    instance = Submission.objects.get(my_uuid=my_uuid)
    instance.rawtext = result
    instance.save()

Update 2 (Working Solution):

models.py class Submission(models.Model):

   @property
   def rawtext(self):
      if getattr(self, '_rawtext_local_change', False):
         return self._rawtext
      if not self.pk:
         return None
      from .documents import SubmissionDocument
      try:
         return SubmissionDocument.get(id=self.pk)._rawtext
      except:
         return None

   @rawtext.setter
   def rawtext(self, value):
      self._rawtext_local_change = True
      self._rawtext = value

documents.py

   @registry.register_document
   class SubmissionDocument(Document):

      rawtext = fields.TextField()

      def prepare_rawtext(self, instance):
         return instance.rawtext

tasks.py

   @shared_task
   def process_ocr(my_uuid)

      result = "test" # will ultimately be OCR'd text

      # note that you must do a save on property fields, can't do an update
      instance = Submission.objects.get(my_uuid=my_uuid)
      instance.rawtext = result
      instance.save()

Solution

  • You can add extra fields in the document definition linked to your model (see the field 'type_to_field' in the documentation https://django-elasticsearch-dsl.readthedocs.io/en/latest/fields.html#using-different-attributes-for-model-fields , and combine this with a 'prepare_xxx' method to initialize to an empty string if the instance is created, and to its current value in case of an update) Would that solve your problem ?

    Edit 1 - Here's what I meant:

    models.py

    class Submission(models.Model):
        @property
        def rawtext(self):
            if getattr(self, '_rawtext_local_change ', False):
                return self._rawtext
            if not self.pk:
                return None
            from .documents import SubmissionDocument
            return SubmissionDocument.get(meta__id=self.pk).rawtext
    
        @property.setter
        def rawtext(self, value):
            self._rawtext_local_change = True
            self._rawtext = value
    

    Edit 2 - fixed code typo