Search code examples
pythonjsondjangopostgresqljsonfield

How do I add a prefetch (or similar) to a django queryset, using contents of a JSONField?


Background

I have a complex analytics application using conventional relational setup where different entities are updated using a CRUD model.

However, our system is quite event driven, and it's becoming a real misery to manage history, and apply data migrations respecting that history.

So we're moving toward a quasi-noSQL approach in which I'm using JSON to represent data with a linked list of patches representing the history.

As far as data structure goes it's a very elegant solution and will work well with other services. As far as django goes, I'm now struggling to use the ORM.

My database is PostgreSQL.

Situation

Previously I had something like (pseudocode):

class Environment(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)

class Analysis(models.Model):
    created = models.DateTime(auto_now_add=True)
    environment = models.ForeignKey(
        "myapp.Environment",
        blank=True,
        null=True,
        related_name="analyses",
    )
    everything_else = models.JSONField()

Whereas I now will have:

class Environment(models.Model):  
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)  
  
class Analysis(models.Model):  
    created = models.DateTime(auto_now_add=True)
    everything = models.JSONField()

Where everything might be a JSON object looking like...

{
  "environment": "2bf94e55-7b47-41ad-b81a-6cce59762160",
  "other_stuff": "lots of stuff"
}

Now, I can still totally get the Environment for a given Analysis because I have its ID, but if I do something like this (again pseudocode):

for analysis in Analysis.objects.filter(created=today).all():
   print(Environment.objects.get(id=analysis.everything['environment'])

I of course get an N+1 error because I'm querying the database for the environment on every iteration.

In a simple script I could put all the ids in a list then do a single query for them all using ids__in filtering... but that doesn't work well with the django ecosystem, which typically relies on a get_queryset method to customise what gets fetched... In particular this will enable the solution to work with

  • API serialisers
  • admin methods.

Question

Given a queryset like Analysis.objects.filter(created=today).all(), how can I adjust this queryset to pre-fetch the related environment objects, now that their IDs are in a JSONField?


Solution

  • Never use JSON fields unless you absolutely can justify it.

    JSON fields are a crutch that people lean on because they can't be bothered to design proper modeling.

    1. Denormalise your json data into separate tables.
    2. Use django history/some other built in/3rd party tools for historical tracking.

    Json fields will obfuscate your logic and make everything un-workable. It's the second time I'm working on a project where engineers though it was elegant or easy and ended up paying half a years worth of time or more to fix the spaghetti caused by these.

    As far as your prefetching stuff, you can use annotations & sub-queries, however anything you try to do with json fields will be expensive. Specifically because you can no longer take advantage of database operations such as joins/indexed lookups etc...

    So how would I proceed in your situation:

    If a problem is too complicated for the current use of JSON fields, but you still need to rely on them:

    Use a post_save signal to populate a table that you maintain, representative of a segment of the json field that you need.

    You can then use that table in your prefetch/ORM operation instead.