Is there an accurate approximation of the acos() function?

I need an acos() function with double precision within a compute shader. Since there is no built-in function of acos() in GLSL with double precision, I tried to implement my own.

At first, I implemented a Taylor series like the equation from Wiki - Taylor series with precalculated faculty values. But that seems to be inaccurate around 1. The maximum error was something about 0.08 with 40 iterations.

I also implemented this method which works very well on CPU with a maximum error of -2.22045e-16, but I have some troubles to implement this within the shader.

Currently, I am using an acos() approximation function from here where somebody posted his approximation functions on this site. I am using the most accurate function of this site and now I get a maximum error of -7.60454e-08, but also that error is too high.

My code of this function is:

double myACOS(double x)
{
    double part[4];
    part[0] = 32768.0/2835.0*sqrt(2.0-sqrt(2.0+sqrt(2.0+sqrt(2.0+2.0*x))));
    part[1] = 256.0/135.0*sqrt(2.0-sqrt(2.0+sqrt(2.0+2.0*x)));
    part[2] = 8.0/135.0*sqrt(2.0-sqrt(2.0+2.0*x));
    part[3] = 1.0/2835.0*sqrt(2.0-2.0*x);
    return (part[0]-part[1]+part[2]-part[3]);
}

Does anybody know another implementation method of acos() which is very accurate and -if possible- easy to implement in a shader?

Some system information:

Nvidia GT 555M
running OpenGL 4.3 with optirun

Solution

My current accurate shader implementation of 'acos()' is a mix out of the usual Taylor series and the answer from Bence. With 40 iterations I get an accuracy of 4.44089e-16 to the 'acos()' implementation from math.h. Maybe it is not the best, but it works for me:

And here it is:

double myASIN2(double x)
{
    double sum, tempExp;
    tempExp = x;
    double factor = 1.0;
    double divisor = 1.0;
    sum = x;
    for(int i = 0; i < 40; i++)
    {
        tempExp *= x*x;
        divisor += 2.0;
        factor *= (2.0*double(i) + 1.0)/((double(i)+1.0)*2.0);
        sum += factor*tempExp/divisor;
    }
    return sum;
}

double myASIN(double x)
{
    if(abs(x) <= 0.71)
        return myASIN2(x);
    else if( x > 0)
        return (PI/2.0-myASIN2(sqrt(1.0-(x*x))));
    else //x < 0 or x is NaN
        return (myASIN2(sqrt(1.0-(x*x)))-PI/2.0);

}

double myACOS(double x)
{
    return (PI/2.0 - myASIN(x));
}

Any comments, what could be done better? For example using a LUT for the values of factor, but in my shader 'acos()' is just called once so there is no need for it.