Summary:
For a dropdown I need to compute the distinct values for a specific key stored inside a JSONField in a table in Postgres database. Worst case scenario: The table contains 1-10 million entries.
Background:
I'm developing a setup in which I have multiple deployments (one for each customer). Each deployment contains a backend and multiple clients. Logs are posted continuously from clients to the backend. These logs will contain a field log_meta
, which includes a key named origin
, that describes from where the log-entry came. From customer to customer the value of origin
may vary, and I don't want to enforce a restricted set of values for origin
, but in general they denote the environment in which the client is running; "DEV" and "PRODUCTION" are potential values for origin
. In practice, there might be only 1-2 distinct values for origin
in a single deployment. The number of logs can be expected to be in the range of 1-10 millions.
from jsonfield import JSONField
from django.db import models
class Log(models.Model)
# Other fields
log_json = JSONField(default=list)
log_meta: JSONField(default=dict) # Will contain a key named origin
In an "administrator frontend" I want to support that the administrator can filter (via a dropdown) to see only logs coming from a specific origin. To do so, I need to extract the distinct values for the origin
field.
How can I compute this set of distinct values in Django, taking into account that the number of logs may in some cases be in the 1-10 million range?
What I've tried already:
Additional info:
The QuerySet to perform required select:
Log.objects.filter(
# some filtering if required
log_meta__origin__isnull=False
).order_by().values_list('log_meta__origin').distinct()
order_by()
is to clear any ordering already present on QuerySet to allow us make distinct()
call later.
Its 'effectiveness' is completely different and subjective question.
PostgreSQL needs to look through all the records to perform this select.
One possibility is to add indexing on just this one field of JSON (like in this SO question)
As this type of select looks like it does not need to be performed frequently (meaning distinct origins are pretty stable, you can i.e. cache the list of distinct values and update it periodically) - use PostgreSQL Materialized Views and update them periodically / on demand (or plain simply store the list in cache (Redis) instead of Materialized Views).