Search code examples
djangoredisartificial-intelligencerankingrelevance

The best solution to retrieve the most relevant outputs to the each user (in Django or any Backend)?


I'm looking for the best solution to retrieve the most relevant outputs to the each user.

I simplified my models as UserProfile and Groups like below

-Model Name: UserProfile
styles: ['a', 'b', 'f', 'r'] <- ('styles' are field name)

-Group 1
styles: ['a', 'f']

-Group 2
['g', 'a', 'h']

 ...

-Group 1,000,000
styles: ['s', 'w', 'x']
(Let's say we have millions of Groups)

I want to sort and retrieve groups based on the user's styles. So in this case, 'Group 1' scores 2 because of the styles 'a', 'f', and 'Group 2' scores 1 because of the style 'a'.

We can't store the scores in our main database because each user has different styles.

  • My Approach 1: rank all database every time when user requests (I wrote a code conceptually)

views.py

for group in Group.objects.all():
    # store the score to the new field of the group
    group.style_count = group.styles.join_count(user.styles)
list_view_output = Group.objects.order_by(style_count)
  • Approach 2: Store the rank on database Execute the query and store the outputs (with rank of course and user id) in Redis in-memory cache database. And retrieve results when specific user wants to

Problems in mind:

  1. The query seems quite expensive. O(n) for iterating * O( min( user.style.count(), group.style.count() ) ) for joining. How can I do better? Maybe I can do something in Model?
  2. Unfortunately if we have a million groups and 1,000 users, I need to store a billion rows in cache memory (Redis). And I can definitely not afford it (I think I can have maximum 8GB,, or maybe more)
  3. Maybe I won't need to store every users' rank data in cache because some users have same styles. Do you know any AI approach on this?

Also could you provide any advice to build this better?

Thank you...!!!!!


Solution

  • The bottom line is that, for problems in the scale of millions, saving everything and sorting it may not be a good idea.

    Using the current data structure, - time complexity: O(n); almost impossible reduce that - space complexity: we can improve it a lot. e.g. only top 10 is needed, the cost can be O(1). We can only save the top 10 in a sorted manner. In the linear scan, we only compare the new group's score to the smallest one and replace it if the new score is bigger

    But a possible issue is that if you have too many groups with the same score, then it could be a problem. You need to handle it carefully.

    If you organize the style and groups in a 'sorted' way, maybe it can be faster. For example, in alphabetical order, Group 2 will be ['a', 'g', 'h']. And we keep track of the index of first one or several letters. For example, Groups who start with 'a' will be the first batch; Then Groups who start with 'b', and so on. In your case, you can do the searching in this manner: 1. first the groups starting with 'a'; 2. then search the groups starting with 'b'; 3. then 'f'; 4. then 'r'. So the groups who doesn't have any of the styles in [a, b, f, r] are not touched. In this way, it could save you a lot of time.