I'm looking for the best solution to retrieve the most relevant outputs to the each user.
I simplified my models as UserProfile and Groups like below
-Model Name: UserProfile
styles: ['a', 'b', 'f', 'r'] <- ('styles' are field name)
-Group 1
styles: ['a', 'f']
-Group 2
['g', 'a', 'h']
...
-Group 1,000,000
styles: ['s', 'w', 'x']
(Let's say we have millions of Groups)
I want to sort and retrieve groups based on the user's styles. So in this case, 'Group 1' scores 2 because of the styles 'a', 'f', and 'Group 2' scores 1 because of the style 'a'.
We can't store the scores in our main database because each user has different styles.
views.py
for group in Group.objects.all():
# store the score to the new field of the group
group.style_count = group.styles.join_count(user.styles)
list_view_output = Group.objects.order_by(style_count)
Problems in mind:
Also could you provide any advice to build this better?
Thank you...!!!!!
The bottom line is that, for problems in the scale of millions, saving everything and sorting it may not be a good idea.
Using the current data structure, - time complexity: O(n); almost impossible reduce that - space complexity: we can improve it a lot. e.g. only top 10 is needed, the cost can be O(1). We can only save the top 10 in a sorted manner. In the linear scan, we only compare the new group's score to the smallest one and replace it if the new score is bigger
But a possible issue is that if you have too many groups with the same score, then it could be a problem. You need to handle it carefully.
If you organize the style and groups in a 'sorted' way, maybe it can be faster. For example, in alphabetical order, Group 2 will be ['a', 'g', 'h']. And we keep track of the index of first one or several letters. For example, Groups who start with 'a' will be the first batch; Then Groups who start with 'b', and so on. In your case, you can do the searching in this manner: 1. first the groups starting with 'a'; 2. then search the groups starting with 'b'; 3. then 'f'; 4. then 'r'. So the groups who doesn't have any of the styles in [a, b, f, r] are not touched. In this way, it could save you a lot of time.