I've tried to create a simple ML agent (ball) to learn to move towards and collide with a target.
Unfortunately, the agent doesn't appear to be learning and simply moves around in what appears to be random positions the whole time. After 5M steps, the average reward remains at -1.
Any suggestions on what I'm doing wrong?
Tensorflow Cumulative reward graph
My observations are here:
/// <summary>
/// Observations:
/// 1: Distance to nearest target
/// 3: Vector to nearest target
/// 3: Target Position
/// 3: Agent position
/// 1: Agent Velocity X
/// 1: Agent Velocity Y
/// //12 observations in total
/// </summary>
/// <param name="sensor"></param>
public override void CollectObservations(VectorSensor sensor)
{
//If nearest Target is null, observe an empty array and return early
if (target == null)
{
sensor.AddObservation(new float[12]);
return;
}
float distanceToTarget = Vector3.Distance(target.transform.position, this.transform.position);
//Distance to nearest target (1 observervation)
sensor.AddObservation(distanceToTarget);
//Vector to nearest target (3 observations)
Vector3 toTarget = target.transform.position - this.transform.position;
sensor.AddObservation(toTarget.normalized);
//Target position
sensor.AddObservation(target.transform.localPosition);
//Current Position
sensor.AddObservation(this.transform.localPosition);
//Agent Velocities
sensor.AddObservation(rigidbody.velocity.x);
sensor.AddObservation(rigidbody.velocity.y);
}
My YAML File config:
behaviors:
PlayerAgent:
trainer_type: ppo
hyperparameters:
batch_size: 512 #128
buffer_size: 2048
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2 #0.2
lambd: 0.99
num_epoch: 3 #3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 32 #256
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 64
learning_rate: 3.0e-4
#keep_checkpoints: 5
#checkpoint_interval: 500000
max_steps: 5000000
time_horizon: 64
summary_freq: 10000
threaded: true
framework: tensorflow
Unity Inspector Component config
Rewards (All on the agent script):
private void Update()
{
//If Agent falls off the screen, give negative reward an end episode
if (this.transform.position.y < 0)
{
AddReward(-1.0f);
EndEpisode();
}
if(target != null)
{
Debug.DrawLine(this.transform.position, target.transform.position, Color.green);
}
}
private void OnCollisionEnter(Collision collidedObj)
{
//If agent collides with goal, provide reward
if (collidedObj.gameObject.CompareTag("Goal"))
{
AddReward(1.0f);
Destroy(target);
EndEpisode();
}
}
public override void OnActionReceived(float[] vectorAction)
{
if (!target)
{
//Place and assign the target
envController.PlaceTarget();
target = envController.ProvideTarget();
}
Vector3 controlSignal = Vector3.zero;
controlSignal.x = vectorAction[0];
controlSignal.z = vectorAction[1];
rigidbody.AddForce(controlSignal * moveSpeed, ForceMode.VelocityChange);
// Apply a tiny negative reward every step to encourage action
if (this.MaxStep > 0) AddReward(-1f / this.MaxStep);
}
How hard would you say that your environment is? If the target is rarely reached, the agent will not be able to learn. In that case, you need to add some intrinsic reward when the agent acts in the right direction. That allows the agent to learn even if the rewards are sparse.
There might also be a problem with reward hacking by the way you have designed the rewards. If the agent is not able to find the target to get the larger reward, the most efficient way is to fall off the platform as quickly as possible to not suffer from the small penalty in each timestep.