OpenCL Sinus Implementation

I'm looking for the OpenCL Sinus Implementation.
Well, I know, the OpenCL Implementation is hardware-vendor-specific, so the Nvidia OpenCL Implementation could look different from the AMD one. But I want to know, whether I need to implement my own sinus for speed reasons.
Accepting this, where is the difference between sin and native_sin?

Solution

Here is an amd implementation, testing repeated sin function on itself so any error will make it more chaotic as iterations increase(100 in this example):

__kernel void sin_test_0(__global  float *a)
{
    int id = get_global_id(0);
    float r=a[id];
    for(int i=0;i<100;i++)
        r = sin(r);
    a[id]=r;
}

a[id] was given value of 1111 first for all 16m elements.

sin() = -0,1692203; completed in 265 ms(320 core gpu) and 1950 ms(8 core cpu using float4)
C#'s implementation with Math library = -0,1692202; completed in 55505 ms(single core) and 12998 ms (4 threads) and 8200 ms (max threads Parallel.For) without any explicit compiler hints about vectorization.
native_sin() = -0,1692208; completed in 45 ms
half_sin() = -0,1692207; completed in 165 ms
series expansion of sine(for input=[-1,1]) = -0,155202; completed in 40 ms

only 7th digit is different and that may be because of C# using double type for computing and native version is a bit farther than original. Half seems to be even better than native but slower. Half_sin has a range of -2^16 to 2^16.

Series expansion:

float sin_se(float x)
{ 
        x -= 6.28318530718f*(convert_int(x*0.15915494309f));
        float xs=x*x;
        float xc=x*x*x;
        return ((x - xc*0.166666f) + (xc*xs)*0.0083333f)- (xc*xs*xs)*0.0001984f;
}

if input is between -1 and +1, first line is not necessary and this becomes faster.

native_sin() is probably using its hardware based options to speed-up. These options could be a look up table for magic numbers and a newton-raphson engine. You may not surpass performance of these parts by software emulation for an equal error. Upper example is on a gpu and there is minor difference using a cpu. Even if opencl dictates that all devices must have less than 100 ULP error, a device may have 90 ULP but other 70ULP and accumulated error increases gap between them. If you think you dont accumulate error much and if you have safety digits, then you could just use native_sin, else, you can add your series expansion-like algorithm so all devices compute same way but with more error.