How to improve efficiency of my python function involving sparse matrices?

I want to implement a particular function in python involving sparse matrices and want to make it as fast as possible. I describe in detail what the problem is and how I implemented it so far.

Problem: I have N=1000 fixed (dimension is fixed, entries fixed) sparse matrices collectively called B each of size 1000x1000 (average sparsity, i.e. number of non-zero entries over all entries, is 0.0001). For a given vector u (of size 1000) I want to compute c[j] = u @ B[j] @ u for each j=0,...,999 and the output should be the numpy array c. So the sparse matrices B[j] (stored in the tuple B) are fixed and u is my function input.

My implementation so far: I precompute all the matrices and treat them as global variables in my program. I decided to safe them as scipy.sparse matrices in csr_matrix format (I read that this is the best format when I just want to calculate matrix vector products) in a Tuple. To calculate my desired function I do

# precompute matrices, so treat them as global fixed variables
B = []
for j in range(1000):
   sparse_matrix = ...compute it... # sparse csr_matrix of size 1000x1000
   B.append(sparse_matrix)
B = tuple(B)

def func(u: np.ndarray) -> np.ndarray:
    """
    u is a np.array of length N=1000
    """
    return np.array([u.dot(B[k].dot(u.transpose())) for k in range(len(B))])

Question: Is this the most efficient it can get, or do you see room for improvement regarding speed, e.g. change the structure how I safe the matrices or how I compute all the vector-matrix-vector products? Also, do you see potential for parallelising this computation? If so, do you have a suggestion what libraries/functions I should look into?

Thanks in advance!

Edit: With c[j] = u @ B[j] @ u I mean that I want to compute the matrix-vector product of B[j] and u and then the inner product with u. (Mathematically u.transposed() * B * u)

If it is of help to anyone. Here is a small benchmark program where I create some random sparse matrices and evaluate it on some random vector.

import numpy as np
from random import randint
from scipy.sparse import coo_matrix, csr_matrix
from time import time

# Create random tuple of sparse matrices
N = 1000
n = 100
B = []
for i in range(N):
    data = np.random.uniform(-1, 1, n).tolist()
    rows = [randint(0, N-1) for _ in range(n)]
    cols = [randint(0, N-1) for _ in range(n)]
    sparse_matrix = csr_matrix(coo_matrix((data, (rows, cols)), shape=(N, N)))
    B.append(sparse_matrix)
B = tuple(B)

# My function
def func(u: np.ndarray) -> np.ndarray:
    """
    u is a np.array of length N=1000
    """
    return np.array([u.dot(B[k].dot(u.transpose())) for k in range(len(B))])

# random vector to evaluate function
u = np.random.uniform(-1, 1, N)

START = time()
func(u)
END = time()
print(f"Speed : {END - START}")

>>> Speed : 0.005256175994873047

Solution

Here is another answer using another data structure for B which is different from the one in the question and Numba-friendly.

Since B is "fixed", we can transform it to a more efficient array-based data-structure packing all COO matrices into only 4 big arrays. The number of non-zero values is stored so to know where each COO matrix start and end. The indices are stored in 16-bit integers (more compact in memory) since all matrices are known to be small (less than 65536x65536 and with less than 65536 non-zeros items). Unsigned integers are used to avoid wrap-around checks in Numba.

Note that answer is an improvement of the answer of Nick ODell.

Here is the resulting code (including the one to build B):

import numpy as np
from scipy.sparse import coo_matrix
import numba as nb

# Converts the input matrix to a numba-friendly data-structure
def to_multi_coo(B):
    all_size = np.array([b.nnz for b in B], dtype=np.uint16)
    all_rows = np.concatenate([b.row.astype(np.uint16) for b in B])
    all_cols = np.concatenate([b.col.astype(np.uint16) for b in B])
    all_data = np.concatenate([b.data for b in B])
    return (all_size, all_rows, all_cols, all_data)

@nb.njit('(Tuple([uint16[::1], uint16[::1], uint16[::1], float64[::1]]), float64[::1])', fastmath=True)
def mult_matrix_multi_coo(B, u):
    all_size, all_rows, all_cols, all_data = B
    n = len(all_size)
    offset = np.uint64(0)
    result = np.empty(n, dtype=np.float64)
    for i in range(n):
        s = 0.0
        for j in range(np.uint64(all_size[i])):
            s += u[all_rows[offset+j]] * u[all_cols[offset+j]] * all_data[offset+j]
        result[i] = s
        offset += all_size[i]
    return result

# Create random tuple of sparse matrices
np.random.seed(42)
N = 1000
n = 100
B = []
for i in range(N):
    data = np.random.uniform(-1, 1, n)
    rows = np.random.randint(0, N, n)
    cols = np.random.randint(0, N, n)
    sparse_matrix = coo_matrix((data, (rows, cols)), shape=(N, N))
    B.append(sparse_matrix)
B = tuple(B)
B = to_multi_coo(B)
u = np.random.uniform(-1, 1, N)

mult_matrix_multi_coo(B, u)

Performance results

Here are results on my machine with a i5-9600KF CPU:

Initial solution:       6135 µs
NickODell's solution:    602 µs
This solution:            59 µs    <-----

Thus, this solution is about 10 times faster than the best solution so far and about 100 times faster than the initial solution. It is so fast that parallelizing it certainly does not even worth it. Indeed, creating threads and waiting for them takes some time: often about dozens of µs on mainstream PCs (often much more on some computing servers, like ones with many-cores).