Faster way to transpose each row of a matrix and multiply the resulting vector by some other matrix?

I have an input matrix X with dimensions N_rows x N_cols. I also have a sparse, tridiagonal matrix M which is square of size N_rows x N_rows. These are created as follows:

N_rows = 3;
N_cols = 6;
X = rand(N_rows,N_cols);

mm = 10*ones(N_cols,1); % Subdiagonal elements
dd = 20*ones(N_cols,1); % Main diagonal elements
pp = 30*ones(N_cols,1); % Superdiagonal elements
M = spdiags([mm dd pp],-1:1,N_cols,N_cols);

and look like the following:

>> X

X =

    0.4018    0.1233    0.4173    0.9448    0.3377    0.1112
    0.0760    0.1839    0.0497    0.4909    0.9001    0.7803
    0.2399    0.2400    0.9027    0.4893    0.3692    0.3897

full(M)

ans =

     2     3     0     0     0     0
     1     2     3     0     0     0
     0     1     2     3     0     0
     0     0     1     2     3     0
     0     0     0     1     2     3
     0     0     0     0     1     2

I would like to take each row of X, and do a matrix multiplication with M, and piece the obtained rows back together to obtain an output Y. At the moment, I achieve this successfully with the following:

Y = (M*X.').';

The example above is for a 3x6 matrix for X, but in reality I need to do this for a matrix with dimensions 500 x 500, about 10000 times, and the profiler says that this operation in the bottleneck in my larger code. Is there a faster way to do this row-by-row matrix multiplication multiplication?

On my system, the following takes around 20 seconds to do this 10000 times:

N_rows = 500;
N_cols = 500;
X = rand(N_rows,N_cols);

mm = 10*ones(N_cols,1); % Subdiagonal elements
dd = 20*ones(N_cols,1); % Main diagonal elements
pp = 30*ones(N_cols,1); % Superdiagonal elements
M = spdiags([mm dd pp],-1:1,N_cols,N_cols);

tic
for k = 1:10000
Y = (M*X.').';
end
toc

Elapsed time is 18.632922 seconds.

Solution

You can use X*M.' instead of (M*X.').';. This saves around 35% of time on my computer.

This can be explained because transposing (or permuting dimensions) implies rearranging the elements in the internal (linear-order) representation of the matrix, which takes time.