I have two quick questions about the use of contextual bandit mode in Vowpal Wabbit.
1) Does --cb
mode output a deterministic policy, which greedily chooses the best action learned by a given training dataset? Thus, the probability of choosing the action is 1, and 0 for all the others.
2) I wonder which theoretical background is behind the policy learning process of --cb_explore
. I know the policy learning process of -cb
is from https://arxiv.org/pdf/1103.4601.pdf. Does --cb_explore
use the same process? Since --cb_explore
is basically a non-stationary policy, I think it should use a different process.
--cb
also supports --epsilon <portion>
(epsilon-greedy algorithm for exploration). During first-time learning, some portion of the action-space is used for further exploration (as opposed to pure greedy exploitation of what is already known).Note: vowpalwabbit.org is an excellent resource for further information on contextual-bandits in vw
.
Vowpal Wabbit supports three (3) contextual bandit base-algorithms:
--cb
The contextual bandit module which allows you to optimize predictor based on already collected data, or contextual bandits without exploration.--cb_explore
The contextual bandit learning algorithm for when the maximum number of actions is known ahead of time and semantics of actions stays the same across examples.--cb_explore_adf
The contextual bandit learning algorithm for when the set of actions changes over time or you have rich information for each action. Vowpal Wabbit offers different input formats for contextual bandits.When exploration is in effect, Vowpal Wabbit offers five (5) exploration algorithms:
--first
--epsilon
--bag
--cover
--softmax
(only supported for --cb_explore_adf
)Working examples which include:
for every option, can be found in the source tree in the file tests/RunTests scroll down to the __DATA__
section to find many command examples.