Search code examples
pythongoogle-app-enginegoogle-bigquerygoogle-cloud-datastoregoogle-prediction

Computing an index that accounts for score and date within Google App Engine Datastore


I'm working on an Google App Engine (python) based site that allows for user generated content, and voting (like/dislike) on that content.

Our designer has, rather nebulously, spec'd that the front page should be a balance between recent content and popular content, probably with the assumption that these are just creating a score value that weights likes/dislikes vs time-since-creation. Ultimately, the goals are (1) bad content gets filtered out somewhat quickly, (2) content that continues to be popular stays up longer, and (3) new content has a chance at staying long enough to get enough votes to determine if its good or bad.

I can easily compute a score based on likes/dislikes. But incorporating the time factor to produce a single score that can be indexed doesn't seem feasible. I would essentially need to reindex all the content every day to adjust its score, which seems cost prohibitive once we have any sizable amount of content. So, I'm at a loss for potential solutions.

I've also suggested something where where we time box it (all time, daily, weekly), but he says users are unlikely to look at the tabs other than the default view. Also, if I filtered based on the last week, I'd need to sort on time, and then the secondary popularity sort would essentially be meaningless since submissions times would be virtually unique.

Any suggestions on solutions that I might be overlooking?

Would something like Google's Prediction API or BigQuery be able to handle this better?


Solution

  • Such a system is often called "frecency", and there's a number of ways to do it. One way is to have votes 'decay' over time; I've implemented this in the past on App Engine by storing a current score and a last-updated; any vote applies an exponential decay to the score based on the last-updated time, before storing both, and a background process runs a few times a day to update the score and decay time of any posts that haven't received votes in a while. Thus, a post's score always tends towards 0 unless it consistently receives upvotes.

    Another, even simpler system, is to serial-number posts. Whenever someone upvotes a post, increment its number. Thus, the natural ordering is by creation order, but votes serve to 'reshuffle' things, putting more upvoted posts ahead of newer but less voted posts.