Search code examples
djangodjango-jsonfield

How do I extract all distint values for a specific key inside JSONField in Django running on Postgres?


Summary:

For a dropdown I need to compute the distinct values for a specific key stored inside a JSONField in a table in Postgres database. Worst case scenario: The table contains 1-10 million entries.

Background:

I'm developing a setup in which I have multiple deployments (one for each customer). Each deployment contains a backend and multiple clients. Logs are posted continuously from clients to the backend. These logs will contain a field log_meta, which includes a key named origin, that describes from where the log-entry came. From customer to customer the value of origin may vary, and I don't want to enforce a restricted set of values for origin, but in general they denote the environment in which the client is running; "DEV" and "PRODUCTION" are potential values for origin. In practice, there might be only 1-2 distinct values for origin in a single deployment. The number of logs can be expected to be in the range of 1-10 millions.

from jsonfield import JSONField
from django.db import models

class Log(models.Model)
    # Other fields
    log_json = JSONField(default=list)
    log_meta: JSONField(default=dict) # Will contain a key named origin

In an "administrator frontend" I want to support that the administrator can filter (via a dropdown) to see only logs coming from a specific origin. To do so, I need to extract the distinct values for the origin field.

How can I compute this set of distinct values in Django, taking into account that the number of logs may in some cases be in the 1-10 million range?

What I've tried already:

  • Nothing, since I don't know how it can be done.

Additional info:

  • The backend is written in Django, using Postgres as database.
  • If it is not feasible to compute the values on the fly, my alternative is to build the set of distinct values continuously as the logs roll in. I consider this a second options, since it introduces additional state; if possible I prefer to just compute/derive the set instead.

Solution

  • The QuerySet to perform required select:

    Log.objects.filter(
        # some filtering if required
        log_meta__origin__isnull=False
    ).order_by().values_list('log_meta__origin').distinct()
    

    order_by() is to clear any ordering already present on QuerySet to allow us make distinct() call later.


    Its 'effectiveness' is completely different and subjective question.

    PostgreSQL needs to look through all the records to perform this select.

    One possibility is to add indexing on just this one field of JSON (like in this SO question)

    As this type of select looks like it does not need to be performed frequently (meaning distinct origins are pretty stable, you can i.e. cache the list of distinct values and update it periodically) - use PostgreSQL Materialized Views and update them periodically / on demand (or plain simply store the list in cache (Redis) instead of Materialized Views).