How to extract output policy from contextual bandit in vowpal wabbit?

I am running this example for contextual bandit, on example data as theirs:

1:2:0.4 | a c  
3:0.5:0.2 | b d  
4:1.2:0.5 | a b c  
2:1:0.3 | b c  
3:1.5:0.7 | a d

with command as their suggestions: vw -d train.dat --cb 4 --cb_type dr -f traindModel

and I wonder how to extract the policy from this command and how to interpret it then?

And then I go

vw -d train.dat --invert_hash traindModel

and recieve such output

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = ../r-mkosinski/train.dat
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.000000   1.000000          1      1.0     1.0000   0.0000        3
4.439352   7.878704          2      2.0     3.0000   0.1931        3
4.457758   4.476164          4      4.0     2.0000   1.4285        3

finished run
number of examples per pass = 5
passes used = 1
weighted example sum = 5
weighted label sum = 13
average loss = 4.14973
best constant = 2.6
total feature number = 16

How to interpet those results? How to extract policy?

I also tried this type of command:

vw -d train.dat --cb 4 --cb_type dr  --invert_hash p2222.txt

and got the following result:

Version 7.8.0
Min label:0.000000
Max label:5.000000
bits:18
0 pairs: 
0 triples: 
lda:0
0 ngram: 
0 skip: 
options: --cb 4 --cb_type dr --csoaa 4
:0
 ^a:108232:0.263395
 ^a:108233:-0.028344
 ^a:108234:0.140435
 ^a:108235:0.215673
 ^a:108236:0.234253
 ^a:108238:0.203977
 ^a:108239:0.182416
 ^b:129036:-0.061075
 ^b:129037:0.242713
 ^b:129038:0.229821
 ^b:129039:0.206961
 ^b:129041:0.185534
 ^b:129042:0.137167
 ^b:129043:0.182416
 ^c:219516:0.264300
 ^c:219517:0.242713
 ^c:219518:-0.158527
 ^c:219519:0.206961
 ^c:219520:0.234253
 ^c:219521:0.185534
 ^c:219523:0.182416
 ^d:20940:-0.058402
 ^d:20941:-0.028344
 ^d:20942:0.372860
 ^d:20943:-0.056001
 ^d:20946:0.326036
Constant:202096:0.263742
Constant:202097:0.242226
Constant:202098:0.358272
Constant:202099:0.205581
Constant:202100:0.234253
Constant:202101:0.185534
Constant:202102:0.326036
Constant:202103:0.182416

Why there are only 5 record for d in output, and 7 for c,b,a? Does it correspond to that features c,b,a occured in data 3 times and d only 2 times? There are also 8 constant rows.. to what do they correspond?

Solution

vw -d train.dat --invert_hash traindModel

No contextual bandit is specified here, so vw does a simple linear regression.

How to interpet those results?

See https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial#vws-diagnostic-information

There are also 8 constant rows.. to what do they correspond?

Contextual bandit in VW is implemented using a reduction to (in this case) cost-sensitive one-against-all multiclass classification. And the csoaa is in turn implemented as a reduction to a linear regression. When using --csoaa 4, each "original feature" is combined with all possible output labels (or actions in case of contextual bandit), so instead of one original feature, there are four features (unfortunately, they have the same name in --invert_hash output, so you cannot be sure which label corresponds to which feature, but they have different hashes, so you see these are different features).

I think contextual bandit also needs to multiply the number of features, but I am not sure what is the multiplication factor for a given --cb_type. From the example, we see it is at least 2, because there are up to 8 features with the same name, while --csoaa 4 is responsible only for the factor of 4.

Why there are only 5 record for d in output, and 7 for c,b,a?

Features with zero weight are not stored in the model.

Does it correspond to that features c,b,a occured in data 3 times and d only 2 times?

Somehow yes, but not directly. As explained above, the features in --invert_hash correspond to feature-label combinations (i.e. combination of the original feature and the output label=action). If a given example is not predicted correctly (during online learning), the feature-correct_label weight will be increased and the feature-predicted_label weight will be decreased (this is the effect of one-against-all reduction). So if a given feature-label combination is never seen in the training data, it is likely that its weight will remain zero (it will be never increased nor decreased).