Search code examples
sqldjangodjango-modelsdjango-postgresql

How to efficiently write DISTINCT query in Django with table having foreign keys


I want to show distinct cities of Users in the front end dropdown. For that, i make a db query which fetches distinct city_name from table City but only those cities where users are present.

Something like below works for a small size of User table, but takes a very long time if User table in of size 10 million. Distinct cities of these users are still ~100 though.

class City(models.Model):
    city_code = models.IntegerField(unique=True)
    city_name = models.CharField(max_length=256)

class User(models.Model):
    city = models.ForeignKey('City', to_field='city_code')

Now i try to search distinct city names as:

City.objects.filter().values_list('city__city_name').distinct()

which translates to this on PostgreSQL:

SELECT DISTINCT "city"."city_name" 
FROM "user" 
LEFT OUTER JOIN "city" 
                ON ("user"."city_id" = "city"."city_code");

Time: 9760.302 ms

That clearly showed that PostgreSQL was not making use of index on 'user'.'city_id'. I also read about a workaround solution here which involved writing a custom SQL query which somehow utilizes index.

I tried to find distinct 'user'.'city_id' using the above query, and that actually turned out to be pretty fast.

WITH 
    RECURSIVE t(n) AS 
                     (SELECT min(city_id) 
                      FROM user 
                      UNION 
                      SELECT 
                            (SELECT city_id 
                             FROM user 
                             WHERE city_id > n order by city_id limit 1) 
                      FROM t 
                      WHERE n is not null) 
                      SELECT n 
                      FROM t;

Time: 79.056 ms

But now i am finding it hard to incorporate this in my Django code. I still think it is a kind of hack adding custom query in the code for this. But a bigger concern for me is that the column name can be totally dynamic, and i can not hardcode these column names (eg. city_id, etc.) in the code.

#original_fields could be a list from input, like ['area_code__district_code__name']
dataset_klass.objects.filter().values_list(*original_fields).distinct()

Using the custom query would need atleast splitting the field name with '__' as delimiter and process the first part. But it looks like a bad hack to me.

How can i improve this?

PS. The City User example is just shown to explain the scenario. The syntax might not be correct.


Solution

  • I finally reached to this workaround solution.

    from django.db import connection, transaction
    
    original_field = 'city__city_name'
    dataset_name = 'user'
    dataset_klass = eval(camelize(dataset_name))
    
    split_arr = original_field.split("__",1)
    """If a foreign key relation is present
    """
    if len(split_arr) > 1:
        parent_field = dataset_klass._meta.get_field_by_name(split_arr[0])[0]
        cursor = connection.cursor()
        """This query will run fast only if parent_field is indexed (city_id)
        """
        cursor.execute('WITH RECURSIVE t(n) AS ( select min({0}) from {1} '
                       'union select (select {0} from {1} where {0} > n'
                       ' order by {0} limit 1) from t where n is not null) '
                       'select n from t;'.format(parent_field.get_attname_column()[1], dataset_name))
        """Create a list of all distinct city_id's"""
        distinct_values = [single[0] for single in cursor.fetchall()]
        """create a dict of foreign key field to the above list"""
        """to get the actual city_name's using _meta information"""
        filter_dict = {parent_field.rel.field_name+'__in':distinct_values}
        values = parent_field.rel.to.objects.filter(**filter_dict).values_list(split_arr[1])
    else:
        values = dataset_klass.objects.filter().values_list(original_field).distinct()
    

    Which utilizes the index on city_id in user table, runs pretty fast.