Fastest implementation to convert a matrix to LAPACK

I have a function that converts a matrix to what LAPACK routines in SciPy requires. I have written short code to illustrate what I do:

import numpy as np
import time

N = 3

def convert_matrix_lapack(Amat,out):
    for ii in range(N):
        out[(2*N-2-ii):(3*N-2-ii), ii] = Amat[:, ii]

A = np.arange(N**2).reshape(N, N)
A_lapack = np.zeros((3*N-2,N), dtype=np.float)

tt = time.time()
convert_matrix_lapack(A, A_lapack)
print(time.time() - tt)

In practice the generic matrix (with N=3),

>>> A
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

becomes

>>> A_lapack
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 2.],
       [0., 1., 5.],
       [0., 4., 8.],
       [3., 7., 0.],
       [6., 0., 0.]])

How can I use built-in functions in NumPy to fasten the code for any N (N less than 50 is my target)?

Solution

This should work:

from scipy.sparse import diags
import numpy as np

N = 3
A = np.arange(N**2).reshape(N,N)


offsets = np.arange(-N+1, 1)
A_lapack = np.flipud(diags(A, offsets, shape=(3*N-2, N)).toarray())

How it works:

It creates a sparse diagonal matrix using scipy.sparse.diags
Flips it upside down using numpy.flipud

If we include the array allocation step using np.zeros, the speed is comparable, but slower for an array of N=100:

0.00021839141845703125 seconds (original code)
0.0017552375793457031 seconds (my solution)