How to fix zoom towards mouse routine?

I'm trying to learn how to zoom towards mouse using Orthographic projection and so far I've got this:

def dolly(self, wheel, direction, x, y, acceleration_enabled):
    v = vec4(*[float(v) for v in glGetIntegerv(GL_VIEWPORT)])
    w, h = v[2], v[3]
    f = self.update_zoom(direction, acceleration_enabled) # [0.1, 4]
    aspect = w/h
    x,y = x-w/2, y-h/2
    K1 = f*10
    K0 = K1*aspect

    self.left = K0*(-2*x/w-1)
    self.right = K0*(-2*x/w+1)
    self.bottom = K1*(2*y/h-1)
    self.top = K1*(2*y/h+1)

x/y: mouse screen coordinates
w/h: window width/height
f: factor which goes from 0.1 to 4 when scrolling down/up
left/right/bottom/top: values used to compute the new orthographic projection

The results I'm getting are really strange but I don't know which part of the formulas I've messed up.

Could you please spot which part of my maths are wrong or just post a clear pseudocode I can try? Just for the record, I've read&tested quite a lot of versions out there on the internet but haven't found yet any place where this subject is explained properly.

Ps. You don't need to post any SO link related to this subject as I've read all of them already :)

Solution

I'm going to answer this in a general way, based on the following set of assumptions:

You use a matrix P for the (ortho) projection describing the actual mapping of your eye space view volume onto the standard view volume [-1,1]^3 OpenGL will clip against (see also assumption 2) and a matrix V for the view transformtation, that is postion and orientation of the "camera" (if there is such a thing, especially in ortho projections) and basically establishing an eye space where your view volume will be defined relative to.
I will ignore the homogeneous clip space, as you work with completely affine ortho projections only, that means NDC coordinates and clip space will be identical, and no tricks to any w coordinate are applied.
I assume default GL conventions for eye space and projection matrices, notably eye space origin is camera location and camera lookat direction is -z
The viewport is filling the window completely.
Windows Space is default OpenGL convention where the origin is at the bottom left.
Mouse coordinates are in some window-specific coordinate frame where the origin is at top left, mouse is at integer pixel coordinates.
I assume that the view volume defined by P is symmetrical: right = -left and top = -bottom, and it is also supposed to stay symmetrical after the zoom operation, therefore, to compensate for any movement, the view matrix V must be adjusted, too.

What you want to get is a zoom such that the object point under the mouse cursor does not move, so becomes the center of the scale operation. The mouse cursor itself is only 2D and a whole straight line in the 3D space will be mapped to the same pixel location. However, in an ortho projection, that line will be orthogonal to the image plane, so we don't need to bother much with the third dimension.

So what we want is to scale the current situation with P_old (defined by the ortho parameters l_old, r_old, b_old, t_old, n_old and f_old) and V_old (defined by "camera" position c_old and ortientation o_old) by a zoom factor s at mouse position (x,y) (in the space from assumption 6).

We can see a few things directly:

the near and far plane of the projection should be unaffected by the operation, so n_new = n_old and f_new = f_old.
the actual camera orientation (or lookat direction) should also be unaffected: o_new = o_old
If we zoom in by a factor of s, the actual view volume must be scaled by 1/s, since when we zoom in, a smaller part of the complete world is mapper on the screen than before (and appears bigger). So we can simply scale the frustum parameters we had: l_new = l_old / s, r_new = r_old / s, b_new = b_old / s, t_new = t_old / s

If new only replace P_old by P_new, we get the zoom, but the world point under the mouse cursor will move (except the mouse is exactly in the center of the view). So we have to compensate for that by modifying the camera position.

Let's first put the mouse coords (x,y) into OpenGL window space (assumptions 5 and 6):

x_win = x + 0.5
y_win = height - 0.5 - y

Note that besides mirroring y, I also shift the coordinates by half a pixels. That's because in OpenGL window space, pixel centers are at half-inter coordinates, while I assume that your integer mouse coordinates are to represent the center of the pixel you click onto (will not make a big difference visually, but still)

Now let's further put the coords into Normalized Device Space (relying on assumption 4 here):

x_ndc = 2.0 * x_win / width - 1
y_ndc = 2.0 * y_win / height - 1

By assumption 2, clip and NDC coordiantes will be identical, and we can call the vector v our NDC/space mouse coordinates: v = (x_ndc, y_ndc, 0, 1)^T

We can now state our "point under mouse must not move" condition:

inverse(V_old) * inverse(P_old) * v = inverse(V_new) * inverse(P_new) * v

But let's just go into eye space and let's look at what happened:

Let a = inverse(P_old) * v be the eye space location of the point under the mouse cursor before we scaled.
Let b = inverse(P_new) * v be the eye space location of the pointer under the mouse cursor after we scaled.

Since we assumed a symmetrical view volume, we already know that for the x and y coordinates, b = (1/s) *a holds (assumption 7. if that assumption does not hold, you need to do the actual calculation for b too, which isn't hard either).

So, we can set up an 2D eye space offset vector d which describes how our point of interest was moved by the scale:

d = b - a = (1 / s) *a  - a = a (1/s - 1)

To compensate for that movement, we have to move our camera inversely, so by -d.

If you keep the camera position separate as I did in assumption 1, you simply need to update the camera position c accordingly. You just have to take care about the fact that c is the world space postion, while d is an eye space offset:

c_new = c_old - inverse(V_old) * (d_x, d_y, 0, 0)^T

Not that if you do not keep the camera position as a separate variable, but keep the view matrix directly, you can simply pre-multiply the translation: V_new = translate(-d_x, -d_y, 0) * V_old

Update

What I wrote so far is correct, but I took a shortcut which is numerically a very bad idea when working with not-infinite precision data types. The error in camera position accumulates very fast if one zooms out a lot. So after @BPL implemted this, this it what he got:

The main issue seems to be that I directly calculated the offset vector d in eye space, which does not take the current view matrix V_old (and its small errors into account). So a more stable approach is to calculate all of this directly in world space:

a     = inverse(P_old * V_old) * v
b     = inverse(P_new * V_old) * v
d     = b - a
c_new = c_old - d

(doing so makes assumption 7 not needed anymore as a by product, so it directly works in the general case of arbitrary ortho matrices).

Using this approach, the zoom operation worked as expected: