optimization neural-network reinforcement-learning

Reinforcement Learning - Optimizing Weights Given Scores

I am working on a project that has a simulated robot exploring an unknown, but patterned environment (such as an office building) by moving around to predefined "sensing locations". In other words, at each point the robot must choose a new location to move to from the available visible locations. Our ultimate goal is to have the robot learn how to exploit the patterns in the environment to optimize global exploration time.

The robot chooses which location to move to next by giving it a utility score based on a linear combination of a number of known features about the location (such as distance to the point, average distance from the point to all others, area around the point already explored, etc.). My goal is to optimize the weights of this utility function to give the fastest time to explore the whole environment.

Because the score depends on the entire exploration path, I do not want to alter the weights mid-exploration. To test a combination of weights, I want the simulated robot to run through the entire environment with those weights, and get the resulting score. Therefore, I can create an |w|+1 x n array of data, where |w| is the number of weights, such as the following:

w1    w2    w3     w4      score
0.23, 4.30, -0.33, -2.001, 17030
-1.3, 2.03, -10.1, -0.021, 21983
3.65, -1.1, 5.021, 0.2301, 19508
etc...

My question is, what sort of reinforcement learning algorithm would be best for this? Most of what I find in the literature and my research has to do with classification, and obviously multivariate regression wont work. I also tried implementing a q-learning algorithm, but this does not really work as there are a variable number of states and actions depending on the path taken and the structure of the environment. What I really want is some sort of structure that takes in row after row of the data, and determines the values of weights and their combinations that maximize the expected score. Any help/ideas? Thanks.

Solution

The way you formalize your setup (no intermediate rewards, no online learning, just a final score) is typical for black-box optimization (or phylogenetic reinforcement learning).

Among the appropriate algorithms are genetic algorithms, evolution strategies or stochastic search. Some state-of-the art algorithms are:

Covariance Matrix Adaptation (CMA-ES)
Differential Evolution (DE)
Natural Evolution Strategies (NES)

that each come in different flavors, depending on how many parameters you have, how noisy your score is and how many local optima you expect.

For a collection of implementations of these in Python, look at the PyBrain library.