python recommendation-engine reinforcement-learning openai-gym

RecoGym dataset is from?

I'm trying to make a taxonomy of learning algorithms by reinforcement for an online shopping system (of which I have the data).

For this I have decided to use RecoGym, but I can't find a way to put my own data into it. Are they purely invented? Is there a way for the reinforcement algorithm to learn based only on the historical data I have?

I enclose the RecoGym usage code to see if you are able to see it.

import gym, reco_gym

# env_0_args is a dictionary of default parameters (i.e. number of products)
from reco_gym import env_1_args

# you can overwrite environment arguments here:
env_1_args['random_seed'] = 42

# initialize the gym for the first time by calling .make() and .init_gym()
env = gym.make('reco-gym-v1')
env.init_gym(env_1_args)

# .reset() env before each episode (one episode per user)
env.reset()
done = False

# counting how many steps
i = 0 

while not done:
    action, observation, reward, done, info = env.step_offline()
    print(f"Step: {i} - Action: {action} - Observation: {observation} - Reward: {reward}")
    i += 1

# instantiate instance of PopularityAgent class
num_products = 10
agent = PopularityAgent(num_products)

# resets random seed back to 42, or whatever we set it to in env_0_args
env.reset_random_seed()

# train on 1000 users offline
num_offline_users = 1000

for _ in range(num_offline_users):

    #reset env and set done to False
    env.reset()
    done = False

    while not done:
        old_observation = observation
        action, observation, reward, done, info = env.step_offline()
        agent.train(old_observation, action, reward, done)

# train on 100 users online and track click through rate
num_online_users = 100
num_clicks, num_events = 0, 0

for _ in range(num_online_users):

    #reset env and set done to False
    env.reset()
    observation, _, done, _ = env.step(None)
    reward = None
    done = None
    while not done:
        action = agent.act(observation, reward, done)
        observation, reward, done, info = env.step(action)

        # used for calculating click through rate
        num_clicks += 1 if reward == 1 and reward is not None else 0
        num_events += 1

ctr = num_clicks / num_events


print(f"Click Through Rate: {ctr:.4f}")

The paper of the environment is here: https://arxiv.org/pdf/1808.00720.pdf

Solution

The data is purely simulated, we think it is reasonable but this is purely a judgement call. You will find that with your real world data that you will only have a log of past actions and how well they performed. This makes it difficult to evaluate algorithms that perform different actions. While you may be able to use the inverse propensity score (IPS) it will often be unacceptably noisy for many important applications.

The role of RecoGym is to help you evaluate an algorithms using a simulated AB test. It has a few agents included that you may try (and more are being added) but it isn't aimed at producing an out of the box solution to your problem, rather a sandbox to help you test and evaluate algorithms.