How to speed up this GSL code for selecting a submatrix?

I wrote a very simple function in GSL, to select a submatrix from an existing matrix in a struct.

EDIT: I had timed VERY INCORRECTLY and didn't notice the changed number of zeros in front.Still, I hope this can be sped up

For 100x100 submatrices of a 10000x10000 matrix, it takes 1.2E-5 seconds. So, repeating that 1E4 times, takes 50 times longer than I need to diagonalise the 100x100 matrix. EDIT: I realise, it happens even if I comment out everything except return(0); Thus, I theorize, it must be something about struct TOWER. This is how TOWER looks:

struct TOWER 
{
    int array_level[TOWERSIZE];
    int array_window[TOWERSIZE];
    gsl_matrix *matrix_ordered_covariance;
    gsl_matrix *matrix_peano_covariance;

    double array_angle_tw[XISTEP];
    double array_correl_tw[XISTEP]; 
    gsl_interp_accel *acc_correl;   // interpolating for correlation
    gsl_spline *spline_correl;

    double array_all_eigenvalues[TOWERSIZE]; //contains all eiv. of whole matrix

    std::vector< std::vector<double> > cropped_peano_covariance, peano_mask;

};

Below comes my function!

/* --- --- */
int monolevelsubmatrix(int i, int j, struct TOWER *tower, gsl_matrix *result)  //relying on spline!! //must addd auto vanishing
{
    int firstrow, firstcol,mu,nu,a,b;
    double aux, correl;

    firstrow = helix*i;
    firstcol = helix*j;

    gsl_matrix_view Xi = gsl_matrix_submatrix (tower ->matrix_ordered_covariance, firstrow, firstcol, helix, helix);
    gsl_matrix_memcpy (result, &(Xi.matrix));

    return(0);  
}
/* --- --- */

Solution

The problem is almost certainly gls_matric_memcpy. The source for that is in copy_source.c, with:

    const size_t src_tda = src->tda ;
    const size_t dest_tda = dest->tda ;
    size_t i, j;

    for (i = 0; i < src_size1 ; i++)
      {
        for (j = 0; j < MULTIPLICITY * src_size2; j++)
          {
            dest->data[MULTIPLICITY * dest_tda * i + j] 
              = src->data[MULTIPLICITY * src_tda * i + j];
          }
      }

This would be quite slow. Note that gls_matrix_memcpy returns a GLS_ERROR if the matrices are different sizes, so it's very likely the data member could be served with a CRT memcpy on the data members of dest and src.

This loop is very slow. Each cell is derefence through dest & src structs for the data member, and THEN indexed.

You could choose to write a replacement for the library, or write your own personal version of this matrix copy, with something like (untested suggestion code here):

unsigned int cellsize = sizeof( src->data[0] ); // just psuedocode here

memcpy( dest->data, src->data, cellsize * src_size1 * src_size2 * MULTIPLICITY )

Note that MULTIPLICITY is a define, usually 1 or 2, probably depends on library configuration - might not apply to your usage (if it's 1 )

Now, important caveat....if the source matrix is a subview, then you have to go by rows...that is, a loop of rows in i where crt's memcpy is limited to rows at a time, not the entire matrix as I show above.

In other words, you do have to account for the source matrix geometry from which the subview was taken...that's probably why they index each cell (makes it simple).

If, however, you KNOW the geometry, you can very likely optimize this WAY above the performance you're seeing.

If all you did was take out the src/dest derefence, you'd see SOME performance gain, as in:

        const size_t src_tda = src->tda ;
        const size_t dest_tda = dest->tda ;
        size_t i, j;

        float * dest_data = dest->data; // psuedocode here
        float * src_data  = src->data; // psuedocode here

        for (i = 0; i < src_size1 ; i++)
          {
            for (j = 0; j < MULTIPLICITY * src_size2; j++)
              {
                dest_data[MULTIPLICITY * dest_tda * i + j] 
                  = src_data[MULTIPLICITY * src_tda * i + j];
              }
          }

We'd HOPE the compiler recognized that anyway, but...sometimes...