a faster way of implementing the nested loop with gamma function

I am trying to evaluate the following integral:

enter image description here

I can find the area for the following polynomial as follows:

pn =

   -0.0250    0.0667    0.2500   -0.6000         0

First using the integration by Simpson's rule

fn=@(x) exp(polyval(pn,x));

area=quad(fn,-10,10);
fprintf('area evaluated by Simpsons rule : %f \n',area)

and the result is area evaluated by Simpsons rule : 11.483072 Then with the following code that evaluates the summation in the above formula with gamma function

a=pn(1);b=pn(2);c=pn(3);d=pn(4);f=pn(5);
area=0;
result=0;
for n=0:40;
    for m=0:40;
        for p=0:40;
            if(rem(n+p,2)==0)
                result=result+ (b^n * c^m * d^p) / ( factorial(n)*factorial(m)*factorial(p) ) *...
                    gamma( (3*n+2*m+p+1)/4 ) / (-a)^( (3*n+2*m+p+1)/4 );
            end
        end
    end
end

result=result*1/2*exp(f)

and this returns 11.4831. More or less the same result with the quad function. Now my question is whether or not it is possible for me to get rid of this nested loop as I will construct the cumulative distribution function so that I can get samples from this distribution using the inverse CDF transform. (for constructing the cdf I will use gammainc i.e. the incomplete gamma function instead of gamma)

I will need to sample from such densities that may have different polynomial coefficients and speed is of concern to me. I can already sample from such densities using Monte Carlo methods but I would like to see whether or not it is possible for me to use exact sampling from the density in order to speed up. Thank you very much in advance.

Solution

There are several things one might do. The simplest is to avoid calling factorial. Instead one can use the relation that

factorial(n) = gamma(n+1)

Since gamma seems to be actually faster than a call to factorial, you can save a bit there. Even better, you can

>> timeit(@() factorial(40))
ans =
      4.28681157826087e-05

>> timeit(@() gamma(41))
ans =
      2.06671024634146e-05

>> timeit(@() gammaln(41))
ans =
      2.17632543333333e-05

Even better, one can do all 4 calls in a single call to gammaln. For example, think about what this does:

gammaln([(3*n+2*m+p+1)/4,n+1,m+1,p+1])*[1 -1 -1 -1]'

Note that this call has no problem with overflows either in case your numbers get large enough. And since gammln is vectorized, that one call is fast. It costs little more time to compute 4 values than it does to compute one.

>> timeit(@() gammaln([15 20 40 30]))
ans =
      2.73937416896552e-05

>> timeit(@() gammaln(40))
ans =
      2.46521943333333e-05

Admittedly, if you use gammaln, you will need a call to exp at the end to recover the final result. You could do it with a single call to gamma however too. Perhaps like this:

g = gamma([(3*n+2*m+p+1)/4,n+1,m+1,p+1]);
g = g(1)/(g(2)*g(3)*g(4));

Next, you can be more creative in the inner loop on p. Rather than a full loop, coupled with a test to ignore the combinations you don't need, why not just do this?

for p=mod(n,2):2:40

That statement will select only those values of p that would have been used anyway, so now you can drop the if statement completely.

All of the above will give you what I'll guess is about a 5x speed increase in your loops. But it still has a set of nested loops. With some effort, you might be able to improve that too.

For example, rather than computing all of those factorials (or gamma functions) many times, do it ONCE. This should work:

a=pn(1);b=pn(2);c=pn(3);d=pn(4);f=pn(5);
area=0;
result=0;
nlim = 40;
facts = factorial(0:nlim);
gammas = gamma((0:(6*nlim+1))/4);
for n=0:nlim
  for m=0:nlim
    for p=mod(n,2):2:nlim
      result = result + (b.^n * c.^m * d.^p) ...
         .*gammas(3*n+2*m+p+1 + 1) ...
         ./ (facts(n+1).*facts(m+1).*facts(p+1)) ...
         ./ (-a)^( (3*n+2*m+p+1)/4 );
    end
  end
end

result=result*1/2*exp(f)

In my test on my machine, I find that your triply nested loops required 4.3 seconds to run. My version above produces the same result, yet required only 0.028418 seconds, a speedup of roughly 150 to 1, despite the triply nested loops.