Search code examples
c#unity-game-engineml-agent

ML agent not learning a relatively 'simple' task


I've tried to create a simple ML agent (ball) to learn to move towards and collide with a target.

Unfortunately, the agent doesn't appear to be learning and simply moves around in what appears to be random positions the whole time. After 5M steps, the average reward remains at -1.

Any suggestions on what I'm doing wrong?

Tensorflow Cumulative reward graph

My observations are here:

/// <summary>
/// Observations:
/// 1: Distance to nearest target
/// 3: Vector to nearest target
/// 3: Target Position
/// 3: Agent position
/// 1: Agent Velocity X
/// 1: Agent Velocity Y
/// //12 observations in total
/// </summary>
/// <param name="sensor"></param>
public override void CollectObservations(VectorSensor sensor)
{

    //If nearest Target is null, observe an empty array and return early
    if (target == null)
    {
        sensor.AddObservation(new float[12]);
        return;
    }

    float distanceToTarget = Vector3.Distance(target.transform.position, this.transform.position);

    //Distance to nearest target (1 observervation)
    sensor.AddObservation(distanceToTarget);

    //Vector to nearest target (3 observations)
    Vector3 toTarget = target.transform.position - this.transform.position;

    sensor.AddObservation(toTarget.normalized);


    //Target position
    sensor.AddObservation(target.transform.localPosition);

    //Current Position
    sensor.AddObservation(this.transform.localPosition);

    //Agent Velocities
    sensor.AddObservation(rigidbody.velocity.x);
    sensor.AddObservation(rigidbody.velocity.y);
}

My YAML File config:

    behaviors:
  PlayerAgent:
    trainer_type: ppo
    hyperparameters:
      batch_size: 512 #128
      buffer_size: 2048
      learning_rate: 3.0e-4
      beta: 5.0e-4
      epsilon: 0.2 #0.2
      lambd: 0.99
      num_epoch: 3 #3
      learning_rate_schedule: linear
    network_settings:
      normalize: false
      hidden_units: 32 #256
      num_layers: 2
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
      curiosity:
        strength: 0.02
        gamma: 0.99
        encoding_size: 64
        learning_rate: 3.0e-4
    #keep_checkpoints: 5
    #checkpoint_interval: 500000
    max_steps: 5000000
    time_horizon: 64
    summary_freq: 10000
    threaded: true
    framework: tensorflow

Unity Inspector Component config

Rewards (All on the agent script):

private void Update()
{

    //If Agent falls off the screen, give negative reward an end episode
    if (this.transform.position.y < 0)
    {
        AddReward(-1.0f);
        EndEpisode();
    }

    if(target != null)
    {
        Debug.DrawLine(this.transform.position, target.transform.position, Color.green);
    }

}

private void OnCollisionEnter(Collision collidedObj)
{
    //If agent collides with goal, provide reward
    if (collidedObj.gameObject.CompareTag("Goal"))
    {
        AddReward(1.0f);
        Destroy(target);
        EndEpisode();
    }
}

public override void OnActionReceived(float[] vectorAction)
{
    if (!target)
    {
        //Place and assign the target
        envController.PlaceTarget();
        target = envController.ProvideTarget();
    }

    Vector3 controlSignal = Vector3.zero;
    controlSignal.x = vectorAction[0];
    controlSignal.z = vectorAction[1];
    rigidbody.AddForce(controlSignal * moveSpeed, ForceMode.VelocityChange);

    // Apply a tiny negative reward every step to encourage action
    if (this.MaxStep > 0) AddReward(-1f / this.MaxStep);

}

Solution

  • How hard would you say that your environment is? If the target is rarely reached, the agent will not be able to learn. In that case, you need to add some intrinsic reward when the agent acts in the right direction. That allows the agent to learn even if the rewards are sparse.

    There might also be a problem with reward hacking by the way you have designed the rewards. If the agent is not able to find the target to get the larger reward, the most efficient way is to fall off the platform as quickly as possible to not suffer from the small penalty in each timestep.