How to make NumPy array initialization faster?

My work requires constructing large square matrices of order 1000 as numpy arrays, whose elements are defined analytically as a function of their indices. Right now I create a zero array, and loop over the elements to construct my required array. This by itself takes a hefty time to evaluate. Is there any way to make the construction more efficient or faster, say by using GPU or parallel computing or the like?

Example: I want to construct a (1000, 1000) matrix J (which is the Jacobian of the system I'm studying). I have the definition for J_ij given by a function J_ij(i, j, a, b), where a and b are parameters which are fixed for a given matrix. I begin by creating the array J of zeros, and iterate along the entries and update them according to the function J_ij. The full codebase is given here, but following is the only relevant part:

    def J_ij(self, i: int, j: int, xi_mat: np.ndarray, eta: np.ndarray):
        """
        Calculates the ij-th element of the jacobian* for input pattern eta

        Parameters
        ----------
        i,j : int
            Indices of the matrix
        xi_mat : np.ndarray
            Memory pattern matrix of shape (p, n)
        eta : np.ndarray
            Input pattern of shape (n, )

        Returns
        -------
        float value of ij-th element of J
        """
        n = self.n
        p = self.p
        jij = 0
        for mu in range(p):
            jij += xi_mat[mu, i]*xi_mat[mu, j]*eta[i]*eta[j]
        if i == j:
            for k in range(n):
                for mu in range(p):
                    jij -= xi_mat[mu, i]*xi_mat[mu, k]*eta[i]*eta[k]
        return jij/n

    def jacobian(self, eta: np.ndarray):
        """
        Generates the jacobian* for stability analysis

        Parameters
        ----------
        eta : np.ndarray
            Input pattern

        Returns
        -------
        jacobian matrix as a (n, n) numpy array
        """
        n = self.n
        J = np.zeros((n, n))
        if self.xi_mat is None:
            self.init_patterns()
        xi_mat = self.xi_mat
        for _ in range(n**2):
            i = _ // n
            j = _  % n
            J[i, j] += self.J_ij(i, j, xi_mat, eta)
        return J

xi_mat is a matrix of rows of binary patterns made of 1 and -1. eta is one such binary pattern.

I tried using cupy instead of numpy, but due to (independent) issues related to my dedicated GPU being not recognized by CUDA in my Arch linux installation, it actually took longer time than numpy.

Solution

Your jacobian function looks a lot like a straightforward matrix multiply, and can be vectorized simply to the following (assuming that xi_mat is a p-by-n matrix and eta is an n-element vector):

M = xi_mat * eta
J = M.T @ M
J -= np.diag(np.sum(J, axis=0))
J /= n

As this uses a NumPy matrix multiply, it should run substantially faster than the manual loop. Timing results on my iPhone for n=1000, p=10, and random xi_mat and eta:

original jacobian calculation: 10.37 seconds
new calculation: 0.00609 seconds (1700x faster)