Search code examples
machine-learningmahoutrecommendation-engine

Mahout Recommender: What relative preference values are suitable for a GenericUserBasedRecommender?


In mahout, I'm setting up a GenericUserBasedRecommender, pretty straight forward for now, typical settings.

In generating a "preference" value for an item, we have the following 5 data points:

Positive interest

  • User converted on item (highest possible sign of interest)
  • Normal like (user expressed interest, e.g. like buttons)
  • Indirect expression of interest (clicks, cursor movements, measuring "eyeballs")

Negative interest

  • Indifference (items the user ignored when active on other items, a vague expression of disinterest)
  • Active dislike (thumbs down, remove item from my view, etc)

Over what range I should express these different attributes, let's use a 1-100 scale for discussion?

  • Should I be keeping the 'Active dislike' and 'Indifference' clustered close together, for example, at 1 and 5 respectively, with all the likes clustered in the 90-100 range?
  • Should 'Indifference' and 'Indirect expressions of interest' by closer to the center? As in 'Indifference' in the 20-35 range and 'Indirect like' in the 60-70 range?
  • Should 'User conversion' blow the scale away and be heads and tails higher than the others? As in: 'User Conversion' @ 100, 'Lesser likes' @ ~65, 'Dislikes' clustered in the 1-10 range?
  • On the scale of 1-100 is 50 effectively "null", or equivalent to no data point at all?

I know the final answer lies in trial and error and in the meaning of our data, but as far as the algorithm goes, I'm trying to understand at what point I need to tip the scales between interest and disinterest for the algorithm to function properly.


Solution

  • The actual range does not matter, not for this implementation. 1-100 is OK, 0-1 is OK, etc. The relative values are all that really matters here.

    These values are estimated by a simple (linearly) weighted average. Therefore the response ought to be "linear". It ought to match an intuition that if action X gets a score 2x higher than action Y, then X should be an indicator of twice as much interest in real life.

    A decent place to start is to simply size them relative to their frequency. If click-to-conversion rate is 2%, you might make a click worth 2% of a conversion.

    I would ignore the "Indifference" signal you propose. It is likely going to be too noisy to be of use.