why does the pseudo-inverse blow up in numpy?

Consider the following code. Why does the norm of pinv of AA.T blow up? Is there a more numerically stable way of computing it?

# Why does the norm of pinv(AA.T) blow up?

import numpy as np
import matplotlib.pyplot as plt

# number of equations
nn = 225 

# norm of pinv of AA and AA.T
AA_pinv_norm  = []
AAT_pinv_norm = []

for nn in range(1,nn):
    AA = np.asarray([[0.1,0.1]]*nn)
    
    pinv_AA = np.linalg.pinv(AA)
    AA_pinv_norm.append(np.linalg.norm(pinv_AA))  
    
    pinv_AAT = np.linalg.pinv(AA.T)
    AAT_pinv_norm.append(np.linalg.norm(pinv_AAT))  
  
    
fig , ax1 = plt.subplots(nrows=1,ncols=1)
ax1.plot(AA_pinv_norm,'go-',markerfacecolor='none',label='norm(pinv(AA))')
ax1.plot(AAT_pinv_norm,'rx-',label='norm(pinv(AA.T))')
ax1.ticklabel_format(useOffset=False,style='plain')    # to prevent scientific notation.
ax1.set_title('Blow up of norm(pinv(AA.T))')
ax1.set_yscale('log')
ax1.set_xlabel('size')
ax1.set_ylabel('norm of pinv')
ax1.legend()

Solution

What happens

You are trying to compute a Moore-Penrose pseudo inverse of a matrix whose rank is 1 and dimension way more than 1.

That is legal, sure. That is even what pseudo inverse are for.

But at some point, in the computation, that will involve many small values that should be exactly 0 but that numerically are not.

Let's do it by hand

import numpy as np

AA=np.asarray([[0.1,0.1]]*100)

u,d,vh=np.linalg.svd(AA, full_matrices=False)

# d are the singular value
# And AA is (u*d)@vh
print((u*d)@vh)

Gives

# [[0.1, 0.1],
#  [0.1, 0.1],
#  ... (97 lines)
#  [0.1, 0.1]]

Pseudo inverse is then

InvAA = (vh.T*(1/d)) @ u.T

Well, almost so. Because d contains some "pseudo 0". So we should use pseudo inverse of d, to do this computation. Pseudo inverse of a diagonal matrix is easy: it is the matrix whose diagonal is made of the inverse of the non zero diagonal elements.

Here, d, is 1D, and just the values of the diagonal. So invd would be [1/x for x in d if np.abs(x)>0]

But, since, this is numerical computation, 0 is the ideal threshold. We need a more realistic one. Let's say 1e-15

In our case, that threshold is just fine: we known the true rank is 1. And svd computation gives for d: d=[1.41421356e+00, 1.79973556e-16]

(So it acted as if rank was 2, with a very small singular value)

invd=[1/x for x in d if np.abs(x)>1e-15]
InvAA = (vh.T * invd) @ u.T
print(InvAA)

You can check that it works:
InvAA@AA@InvAA is InvAA
AA@InvAA@AA is AA
AA@InvAA and InvAA@AA are both symmetric

Let's do the same from AA.T

u,d,vh=np.linalg.svd(AA.T, full_matrices=False)
invd=[1/x for x in d if np.abs(x)>1e-15]
InvAAT = (vh.T * invd) @ u.T
print(AA.T@InvAAT@AA.T)

This times, it doesn't work. And if we look at InvAAT values, what happens is obvious: very huge values. Note, that it almost works tho. AA.T@InvAAT@AA.T is not that far from being the same as AA.T, compared to the huge values that computation is made of. So, it is obviously a numerical precision problem.

If we look at d we can immediately see why

d is [1.41421356e+00, 1.95277784e-15]

So, our 1e-16 threshold is not sufficient here.

What if we just lower it

u,d,vh=np.linalg.svd(AA.T, full_matrices=False)
invd=[1/x for x in d if np.abs(x)>1e-14]
InvAAT = (vh.T * invd) @ u.T
print(AA.T@InvAAT@AA.T)

It works again!

Of course, with the risk of loosing information, if one of those cut "almost 0" singular values were not really 0. But in our case, no problem (any way, we know the real rank is 1. So we could in reality choose to just keep the fist value of d and discard the rest).

Why it happens

Now, why this happens more with AA.T than with AA; why the numerical error is bigger with AA.T than with AA. I can't really say, since I don't know the specific of numpy's svd algorithm.

But if you look at many equivalent way to compute Moore-Penrose's pseudo inverse, or singular value decomposition, you'll see that it involves either computing eigen values of M.T@M. And AA.T@AA is a 2×2 matrix. When AA.T.T@AA.T is a nn×nn matrix. So, unsurprisinly, the latter involves more noise.

Another method is to compute limit of (M.T@M+εI)⁻¹@M.T. Which likewise, involves inverting a 2×2 matrix, with an almost rank 1, but one ε value, to make the rank to 2, in the case of AA. And a 100×100 (for nn=100 as in my example) matrix with an almost rank 1, but 99 ε values to make the rank 100, in the case of AA.T. Obviously, you will have a harder time make it converge, and numerical error when ε is really small will be bigger in the latter case than in the former.

You can play with this method

np.linalg.inv(AA.T@AA + 1e-13*np.identity(2))@AA.T

Seems to give a reasonable value for the pseudo inverse. While

np.linalg.inv(AA.T@AA + 1e-14*np.identity(2))@AA.T

is worse (so if you make ε smaller and smaller, the approximation is better and better till 1e-13, then it becomes worse and worse. For the usual reason: math says "the smaller the better", while numerical error says "the smaller the worse". Apparently, numerical error takes the leads around 1e-14.

If you do the same for AA.T case

np.linalg.inv(AA.T@AA + ε*np.identity(2))@AA.T

This times it is around 1e-10 that numerical error starts to worsen the results.

So, with this also, you can see that it is harder, from a numerical point of view, to compute pseudo inverse of AA.T than to compute the pseudo inverse of AA

None of the two methods I've given are probably the one used by numpy. But same kind of consideration probably applies. Numerical error is just bigger in case of AA.T

How to prevent it

So, long story short: what you must consider to be 0 when selecting singular values need to be bigger in second case. And with pinv that is what rcond parameter is for.

Just add rcond=1e-10 in your pinv call, and you get this

But of course, this comes with the usual tradeoff: a too big rcond risk ignoring significant (small but really not 0, not just numerical error) values. So you risk to end up with a pseudo inverse of a rank even smaller than the rank of your matrix. Of course, not in your case, since you matrix is rank 1. And but for the null matrix, there isn't any smaller rank any way. As I said, you could even just, with the svd method, drop all the d values but the first one: (v.h/d[0])@u. But that is only because you know that your matrix is really of rank 1 (and that numpy's svd sort singular value from the bigger to the lower)