Search code examples
pythonreinforcement-learningvowpalwabbit

(vowpal wabbit) contextual bandit dealing with new context


This last days I'm trying to train a contextual bandit algorithm throw Vowpalwabbit, so I'm doing some toy-model that can help me understand how the algorithm works.

So I imagined a state with 4 possible action and I train my model on two different context. Each context has only one optimal action among the 4 actions.

That's how I did it.

vw = pyvw.vw("--cb_explore 4 -q UA --epsilon 0.1")
vw.learn('1:-2:0.5 | 5')
vw.learn('3:2:0.5 | 5')
vw.learn('1:2:0.5 | 15')
vw.learn('3:-2:0.5 | 15')
vw.learn('4:2:0.5 | 5')
vw.learn('4:2:0.5 | 15')
vw.learn('2:2:0.5 | 5')
vw.learn('2:2:0.5 | 15')

So for my example for the context with his feature equal to 5 the optimal action is 2 and for the other one the optimal action is 3.

When I predict on those two context, there is no problem since the algorithm meet them already once and had get a reward conditioning his choice.

But when I arrive with a new context I expect the algorithm to make me the most relevant action, for example by taking into account the similarity of the context features.

So for example if I give a feature that equal to 29, I'm expecting to get action 3, since 29 is more near to 15 than 5.

So that my interrogations right now.

Thanks !


Solution

  • The problem is in the way you've structured the feature. The input format for a feature is defined as name[:value], and if value is not supplied the default value is 1.0. So what you've supplied is a feature whose name is 5, or 15. Feature names are hashed and used to determine the index of the feature. So in your case feature 5 and feature 15 both have a value of 1.0 and are distinct features with different indices.

    Therefore, to fix your problem you just need to give your features a name.

    vw.learn('1:-2:0.5 | my_feature_name:5')
    vw.learn('1:2:0.5 | my_feature_name:15')
    

    You can read more about the input format here.

    Also, I'd like to point out that -q UA is not doing anything in your example, as you do not have namespaces. Namespaces can be specified by placing them next to the bar. The following example has two namespaces, A and B. (Note: if more than one character is used for namespace only the first character is used with -q)

    1:-2:0.5 |A my_feature_name:5 |B yet_another_feature:4
    

    In this case if we supplied -q AB, then VW would create a new feature for each pair of features in A and B at runtime. This allows you to express more complicated interactions in the representation VW learns.