Perspective projection bug (not understanding theory)

I'm implementing my own renderer based on rasterization and depth buffering with the CPU. As you can see in the next image it works!

However, there's something terribly wrong! Even that the box looks like a cube, its dimensions are 1000x1000x10. The "foreshortening" is too high. If I change the dimensions to 1000x1000x1000, the box goes to the infinite:

This happens because there's something that I'm missing in the perspective projection. When I do the perspective projection (3D world to 2D screen) I apply the view transform (place the coordinate system at the camera position and with proper orientation). To simplify things, my camera has the same orientation as the world, the only thing that it changes is the position, is at (0, 0, -1):

  const Point3D point_camera = {
     point_world.x * m_left.x + point_world.y * m_up.x + point_world.z * m_forward.x - m_position.x,
     point_world.x * m_left.y + point_world.y * m_up.y + point_world.z * m_forward.y - m_position.y,
     point_world.x * m_left.z + point_world.y * m_up.z + point_world.z * m_forward.z - m_position.z
  };

And then I apply the perspective division dividing each component of the 3D world point by its Z:

  const Point2D point_projected = {
     (point_camera.x * m_near * m_zoom) / point_camera.z,
     (point_camera.y * m_near * m_zoom) / point_camera.z
  };

I feel like I should be multiplying the Z by some kind of factor...but I can't figure it out. Maybe the w has something to do with this? If someone could help me or forward me to a good explanation of the theory behind the perspective projection I will be very grateful.

All the code is on my github. The relevant classes are PerspectiveCamera and ForwardRasterizer

Solution

So if you want to transform a 3 dimensional vector or point using a 4x4 matrix, as you probably already know, you need to augment it with another component (frequently called w). For the sake of proper transformation w is often chosen to be 1 Such that (x, y, z) becomes (x, y, z, 1).

Let's assume that your point has already been transformed into view space, so that it's 'in front' of the camera. The next, and last, transformation that you want to perform is the actual projection. If you multiply the vector against this matrix (which is a bit different from Datenwolfs) you will get a resulting column vector that looks similar to this

I should mention, in this vector I left out the scaling of x and z for the aspect ratio just to make things a little more clear.

You can see that the third element in the vector is effectively normalizing z as you mentioned. Lets assume w is still 1, f which is our far plane is 100, and n our near plane is 10. With this assumption, the third element becomes.

If you plug in a few different numbers for z that are in the range [10, 100] then you will see that this is indeed just normalizing the z.

The magic of projection happens in an unseen step, a division by of x', y', z' by w' which applies the actual perpective effect to x' and y' in screen space, but also renormalizes your z' such that it's value looks something like this as the point moves away from the viewer.

As a side note, the shape of the graph is intentional. It's to take advantage of the precision limitations of the depth buffer, but that's a whole other discussion.

The last thing I would like to make note of is the negative value of w' in the column vector. You might be thinking to yourself, "Why on earth would they do that? Wouldn't that flip everything vertically and horizontally?" And you are right, that's exactly what it does. The reason for that is something I've already mentioned a few times. The canonical view volume, which for OpenGL has screen space mappings that look like this.

Unlike other screen coordinate systems, the origin is in the dead center of the screen. No matter the screen's dimensions. In addition to that quirk, the most positive coordinate is in the upper right of the screen, rather than the lower right. Consequentially, in many cases your projected coordinates must then be flipped, hence the -z as the value for w'.

I hope this helps you out!