I need to calculate Stirling's approximation very fast

I'm writing a small library for statistical sampling which needs to run as fast as possible. In profiling I discovered that around 40% of the time taken in the function is spent computing Stirling's approximation for the logarithm of the factorial. I'm focusing my optimization efforts on that piece of it. Here's my code (which uses MPFR):

const double AL[8] =
{ 0.0, 0.0, 0.6931471806, 1.791759469, 3.178053830, 4.787491743,
    6.579251212, 8.525161361 };
void HGD::mpfr_afc(mpfr_t &ret, const mpfr_t &val){

    if(mpfr_cmp_ui(val, 7) <= 0){
        mpfr_set_d(ret, AL[mpfr_get_ui(val, MPFR_RNDN)], MPFR_RNDN);
    }else{
        mpfr_set(ret, val, MPFR_RNDN);
        mpfr_add_d(ret, ret, 0.5, MPFR_RNDN);
        mpfr_log(LV, val, MPFR_RNDN);
        mpfr_mul(ret, LV, ret, MPFR_RNDN);
        mpfr_sub(ret, ret, val, MPFR_RNDN);
        mpfr_add_d(ret, ret, 0.399089934, MPFR_RNDN);
    }
}

I have a couple different ideas:

Precompute more than the first 8 inputs to the function.
Optimize the math (use a coarser approximation for smaller precision)
Use multiple threads to compute on different inputs in parallel
Switch to native arithmetic when numbers can fit in machine data types

Are there other approaches I could take?

Solution

Switch to native arithmetic when numbers can fit in machine data types

That would be my first attempt. MPFR is likely to be a performance killer.

It seems to me you want to compute the logarithm of n! which you are already approximating with Stirling's formula.

Note that n!=Gamma(n+1). There are (seemingly) highly optimized functions to compute both the Gamma function and its logarithm. For example:

I would roll my own coarser approximation only if all the above fails performance-wise.