Search code examples
machine-learningreinforcement-learningvowpalwabbitrecommendation-engine

How to learn to rank using Vowpal Wabbit's contextual bandit?


I am using Vowpal Wabbit's contextual bandit to rank various action given a context.

Train Data:
"1:10:0.1 | 123"
"2:9:0.1 | 123"
"3:8:0.1 | 123"
"4:7:0.1 | 123"
"5:6:0.1 | 123"
"6:5:0.1 | 123"
"7:4:0.1 | 123"

Test Data:
" | 123"

Now, the expected ranking of action should be (from least loss to most loss):

7 6 5 4 3 2 1

Using --cb just returns the most optimal action:

7

And using --cb_explore returns a pdf of the actions to be explored but it doesn't seem to help in ranking.

[0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.9571428298950195]

Is there any other way of using vw's contextual bandit for ranking?


Solution

  • Olga's response on the repo: https://github.com/VowpalWabbit/vowpal_wabbit/issues/2555

    --cb does not do any exploration and just trains the model given the input so the output will be what the model (that has been trained so far) predicted

    --cb_explore includes exploration using epsilon-greedy by default if nothing else is specified. You can take a look at all the available exploration methods here

    cb_explore's output is the PMF given by the exploration strategy (see here for more info).

    Epsilon-greedy will choose, with probability e, an action at random from a uniform distribution (exploration), and with probability 1-e epsilon-greedy will use the so-far trained model to predict the best action (exploitation).

    So the output will be the pmf over the actions (prob. 1-e OR e for the chosen action) and then the remaining probability will be equally split between the remaining actions. Therefore cb_explore will not provide you with a ranking.

    One option for ranking would be to use CCB. Then you get a ranking and can provide feedback on any slot, but it is more computationally expensive. CCB runs CB for each slot, but the effect is a ranking since each slot draws from the overall pool of actions.

    And my follow up:

    I think CCB is a good option if computational limits allow. I'd just like to add that if you do cb_explore or cb_explore_adf then the resulting PMF should be sorted by score so it is a ranking of sorts. However, it's worth verifying that the ordering is in fact sorted by scores (--audit will help here) as I don't know if there is a test covering this.