Hello I made a function that will return posts based on when they were created and ranked by the number of votes they have. This is for a feed. I'm not sure that if this is an efficient way to do this and it's important that I get it right because this could potentially be sifting through thousands of posts. I have simplified it some here and briefly explained each step. My current concerns and a post of the whole function (without annotations) are after.
The start hours argument is for how many hours ago to get posts. For example, if startHours=6, than only posts created after six hours ago will be returned.
def rank(request, startHours):
First I order all of the posts by votes
unorderedPosts=Post.objects.order_by('-votes')
Then posts are excluded by categories the user has specified
if request.user.is_authenticated():
preferences=request.user.categorypreference_set.filter(on=False)
for preference in preferences:
unorderedPosts=unorderedPosts.exclude(category_name=preference.category_name)
Then I make endHours, which is always 4 hours ahead of startHours argument
endHours=startHours+4 #4 hour time window
Now I slice the unorderedPosts and get only the ones created in the time window startHours to endHours from now. For example, if startHours=4, than only posts created after 4 hours ago but before 8 hours ago will be returned.
posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))
posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))
Now I make a loop that will move the time window back until at least one post is found that has a creation date that fits the time window. I make the check variable to prevent an infinite loop (the loop will quit if no posts are found within 200 hours).
count=posts.count()
check=endHours
while count<1 and endHours<(check+200):
endHours+=4
startHours+=4
posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))
posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))
count=posts.count()
if count>=1: return posts, endHours
return posts
My biggest concern is making a queryset of ALL the posts in the beginning. This function is meant to return posts in small time windows, will it be unnecessarily slowed down by ranking all of the posts in the database? I know that django/python querysets are quite efficient but won't ranking a set that contains thousands of objects be cumbersome for the purpose of this function?
If this is a problem, how could I make it more efficient while keeping everything accessible?
Here is a post of the whole thing.
def rank(request, startHours):
unorderedPosts=Post.objects.order_by('-upVote')
if request.user.is_authenticated():
preferences=request.user.categorypreference_set.filter(on=False)
for preference in preferences: #filter by category preference
unorderedPosts=unorderedPosts.exclude(category_name=preference.category_name)
endHours=startHours+4 #4 hour time window
posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))
posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))
count=posts.count()
check=endHours
while count<1 and endHours<(check+200):
endHours+=4
startHours+=4
posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))
posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))
count=posts.count()
if count>=1: return posts
return posts
Your main concern is not something you need to worry about. Check the docs on When querysets are evaluated - you can define and add clauses to a queryset indefinitely and it won't actually be run against the database until you call something that actually requires hitting the database.
What will require multiple queries is iterating through time until you hit a window that has posts. You'll have better performance if you check the latest created
time in one call, use that to work out your window, then limit the posts based on that and then sort by number of votes.
Something like:
unorderedPosts = Post.objects.all()
if request.user.is_authenticated():
preferences=request.user.categorypreference_set.filter(on=False)
for preference in preferences: #filter by category preference
unorderedPosts=unorderedPosts.exclude(category_name=preference.category_name)
latest_post_datetime = unorderedPosts.aggregate(Max('created'))['created__max']
original_start_time = datetime.datetime.now() - datetime.timedelta(hours=startHours)
latest_post_day_start_time = datetime.datetime.combine(latest_post_datetime.date(), original_start_time.time())
# a timedelta guaranteed to be less than 24 hours
time_shift = latest_post_day_start_time - latest_post_datetime
timewindow = datetime.timedelta(hours=4)
if time_shift.days >= 0:
extra_windows_needed = time_shift.seconds / timewindow.seconds
else:
# negative timedeltas store negative days, then positive seconds; negate
extra_windows_needed = -(abs(time_shift).seconds) / timewindow.seconds
start_time = latest_post_day_start_time - (timewindow * (extra_windows_needed + 1))
posts = unorderedPosts.filter(created__gte=start_time).order_by('-upVote')
return posts
The math here is only right as long as your number of hours in your window (4) divides evenly into the day - otherwise calculating the correct offset gets trickier. Basically, you need to take time offset mod time window length, and I'm exploiting the fact that if you end up in the same calendar day I know the days mod four hours part works out.
Also, it doesn't include an end time because your original logic doesn't enforce one for the initial startHours
period. It'll only move the start time back out of that if there are none within it, so you don't need to worry about stuff that's too recent showing up.
This version makes at most three DB queries - one to get the category preferences for a logged in user, one to get latest_post_datetime
and one to get posts
, with confidence of having at least one matching post.
You might also consider profiling to see if your DB back end does better with a subquery to exclude unwanted categories:
if request.user.is_authenticated():
unorderedPosts = unorderedPosts.exclude(category_name__in=request.user.categorypreference_set.filter(on=False).values_list('category_name')
As the __in lookup docs note, performance here varies with different database back ends.