(Vowpal Wabbit) cb mode in contextual bandit problem

I have two quick questions about the use of contextual bandit mode in Vowpal Wabbit.

1) Does --cb mode output a deterministic policy, which greedily chooses the best action learned by a given training dataset? Thus, the probability of choosing the action is 1, and 0 for all the others.

2) I wonder which theoretical background is behind the policy learning process of --cb_explore. I know the policy learning process of -cb is from https://arxiv.org/pdf/1103.4601.pdf. Does --cb_explore use the same process? Since --cb_explore is basically a non-stationary policy, I think it should use a different process.

Solution

Quick answers:

Fundamentally yes, given a certain context, the action with the best (known) reward wins. However note that learning with --cb also supports --epsilon <portion> (epsilon-greedy algorithm for exploration). During first-time learning, some portion of the action-space is used for further exploration (as opposed to pure greedy exploitation of what is already known).
Multiple exploration-vs-exploitation algorithms and some further hyper-parameters per algorithm are supported, depending on command line options

A more detailed answer sourced from vowpalwabbit.org

Note: vowpalwabbit.org is an excellent resource for further information on contextual-bandits in vw.

Vowpal Wabbit supports three (3) contextual bandit base-algorithms:

--cb The contextual bandit module which allows you to optimize predictor based on already collected data, or contextual bandits without exploration.
--cb_explore The contextual bandit learning algorithm for when the maximum number of actions is known ahead of time and semantics of actions stays the same across examples.
--cb_explore_adf The contextual bandit learning algorithm for when the set of actions changes over time or you have rich information for each action. Vowpal Wabbit offers different input formats for contextual bandits.

When exploration is in effect, Vowpal Wabbit offers five (5) exploration algorithms:

Explore-First: --first
Epsilon-Greedy: --epsilon
Bagging Explorer: --bag
Online Cover: --cover
Softmax Explorer: --softmax (only supported for --cb_explore_adf)

Working examples which include:

full command line
input data
expected output

for every option, can be found in the source tree in the file tests/RunTests scroll down to the __DATA__ section to find many command examples.