python-3.x machine-learning vowpalwabbit bandit-python

What is Vowpal Wabbit’s default learner for CMAB Framework?

I’m checking Vowpal Wabbit’s documentation for how it’s actually learning. Traditional Contextual Bandits learn by having F(context, action) = Reward, find action that maximizes Reward, and returns action as recommendation. The “F” is any model; linear, neural net, xgb, etc... that is learned through batch processing. I.E. collect 100 contexts, 100 actions, 100 rewards, train ML model, then do it again.

Now, on VW it says it reduces “all contextual bandit problems to cost-sensitive multiclass classification problems.” Ok, read up on that but there still needs to be some function F to minimize this problem doesn’t there?

I’ve thoroughly read the documentation and either:

Missed what the default learner is for batch processing or,
Don’t understand how VW is actually learning in this cost-sensitive framework?

I’ve even scoured the vw.learn() method inside pyvwlib. Thanks for the help!

Solution

Missed what the default learner is for batch processing or,

The default learner in VW is SGD on a linear representation, but this can be modified using command line arguments.

Don’t understand how VW is actually learning in this cost-sensitive framework?

In contextual bandit learning, the reward associated with the taken action is presented for learning. VW in ips mode converts this into a reward for each action by putting zeros at the actions not taken and importance-weighting the reward for the action taken. Having imputed the missing data, it then treats the problem as a supervised learning problem.