Search code examples
pythondjangopostgresqlfull-text-searchfull-text-indexing

Django full text search using indexes with PostgreSQL


After solving the problem I asked about in this question, I am trying to optimize performance of the FTS using indexes. I issued on my db the command:

CREATE INDEX my_table_idx ON my_table USING gin(to_tsvector('italian', very_important_field), to_tsvector('italian', also_important_field), to_tsvector('italian', not_so_important_field), to_tsvector('italian', not_important_field), to_tsvector('italian', tags));

Then I edited my model's Meta class as follows:

class MyEntry(models.Model):
    very_important_field = models.TextField(blank=True, null=True)
    also_important_field = models.TextField(blank=True, null=True)
    not_so_important_field = models.TextField(blank=True, null=True)
    not_important_field = models.TextField(blank=True, null=True)
    tags = models.TextField(blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'my_table'
        indexes = [
            GinIndex(
                fields=['very_important_field', 'also_important_field', 'not_so_important_field', 'not_important_field', 'tags'],
                name='my_table_idx'
            )
        ]

But nothing seems to have changed. The lookup takes exactly the same amount of time as before.

This is the lookup script:

from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector

# other unrelated stuff here
vector = SearchVector("very_important_field", weight="A") + \
             SearchVector("tags", weight="A") + \
             SearchVector("also_important_field", weight="B") + \
             SearchVector("not_so_important_field", weight="C") + \
             SearchVector("not_important_field", weight="D")
query = SearchQuery(search_string, config="italian")
rank = SearchRank(vector, query, weights=[0.4, 0.6, 0.8, 1.0]). # D, C, B, A
full_text_search_qs = MyEntry.objects.annotate(rank=rank).filter(rank__gte=0.4).order_by("-rank")

What am I doing wrong?

Edit:

The above lookup is wrapped in a function I use a decorator on to time. The function actually returns a list, like this:

@timeit
def search(search_string):
    # the above code here
    qs = list(full_text_search_qs)
    return qs

Might this be the problem, maybe?


Solution

  • You need to add a SearchVectorField to your MyEntry, update it from your actual text fields and then perform the search on this field. However, the update can only be performed after the record has been saved to the database.

    Essentially:

    from django.contrib.postgres.indexes import GinIndex
    from django.contrib.postgres.search import SearchVector, SearchVectorField
    
    class MyEntry(models.Model):
        # The fields that contain the raw data.
        very_important_field = models.TextField(blank=True, null=True)
        also_important_field = models.TextField(blank=True, null=True)
        not_so_important_field = models.TextField(blank=True, null=True)
        not_important_field = models.TextField(blank=True, null=True)
        tags = models.TextField(blank=True, null=True)
    
        # The field we actually going to search.
        # Must be null=True because we cannot set it immediately during create()
        search_vector = SearchVectorField(editable=False, null=True)  
    
        class Meta:
            # The search index pointing to our actual search field.
            indexes = [GinIndex(fields=["search_vector"])]
    

    Then you can create the plain instance as usual, for example:

    # Does not set MyEntry.search_vector yet.
    my_entry = MyEntry.objects.create(
        very_important_field="something very important",  # Fake Italien text ;-)
        also_important_field="something different but equally important"
        not_so_important_field="this one matters less"
        not_important_field="we don't care are about that one at all"
        tags="things, stuff, whatever"
    

    Now that the entry exists in the database, you can update the search_vector field using all kinds of options. For example weight to specify the importance and config to use one of the default language configurations. You can also completely omit fields you don't want to search:

    # Update search vector on existing database record.
    my_entry.search_vector = (
        SearchVector("very_important_field", weight="A", config="italien")
        + SearchVector("also_important_field", weight="A", config="italien")
        + SearchVector("not_so_important_field", weight="C", config="italien")
        + SearchVector("tags", weight="B", config="italien")
    )
    my_entry.save()
    

    Manually updating the search_vector field every time some of the text fields change can be error prone, so you might consider adding an SQL trigger to do that for you using a Django migration. For an example on how to do that see for instance a blog article on Full-text Search with Django and PostgreSQL.

    To actually search in MyEntry using the index you need to filter and rank by your search_vector field. The config for the SearchQuery should match the one of the SearchVector above (to use the same stopword, stemming etc).

    For example:

    from django.contrib.postgres.search import SearchQuery, SearchRank
    from django.core.exceptions import ValidationError
    from django.db.models import F, QuerySet
    
    search_query = SearchQuery("important", search_type="websearch", config="italien")
    search_rank = SearchRank(F("search_vector"), search_query)
    my_entries_found = (
        MyEntry.objects.annotate(rank=search_rank)
        .filter(search_vector=search_query)  # Perform full text search on index.
        .order_by("-rank")  # Yield most relevant entries first.
    )