Search code examples
cfixed-pointtaylor-series

Single precision in Fixed Point Arithmetic


I need up to 6 decimal places precision for a Taylor series calculation using fixed point arithmetic. I have tried different fixed point format for achieving 6 decimal places precision.

For example, Using s16.15 (Left shift by 15) format I have got up to 2 decimal places precision.1 sign bit,16 integer bits and 15 fraction bits.

For s8.23 (Left shift by 23) format up to 4 decimal places and with s4.27 (Left shift by 27) format the precision is still the same. I was expecting the situation will improve.

The following is a Taylor Series expansion to calculate natural logarithm around a certain point a.

So q=x-a, x is the user input between 1 and 2.

  // These are converted constants into s4.27 fixed point format
  const int32_t con=0x0B8AA3B3; //1.44269504088895
  const int32_t c0=0x033E647E; //0.40546510810816
  const int32_t c1=0x05555555; //0.66666666666666
  const int32_t c2=0x01C71C72; //0.222222222222
  const int32_t c3=0x00CA4588; //0.0987654321
  const int32_t c4=0x006522C4; //0.04938271605
  const int32_t c5=0x0035F069; //0.02633744856
  const int32_t c6=0x001DF757; //0.01463191587

//Expanded taylor series    
taylor=c0+mul(q,(c1-mul(q,(c2+mul(q,(c3-mul(q,(c4-mul(q,(c5+mul(q,c6)))))))))));
// Multiplication function
int32_t mul(int32_t x, int32_t y)
{
int32_t mul;
mul=((((x)>>13)*((y)>>13))>>1); // for s4.27 format, the best possible right shift
return mul;
}

Above mentioned code snippets were used in C.

Result I need: 0.584963 but the result I got is: 0.584949

How can I achieve more precision?


Solution

  • OP's mul() throws away too much precision.

    (x)>>13)*((y)>>13) immediately discards the least significant 13 bits of x and y.

    Instead, perform a 64-bit multiply

    int32_t mul_better(int32_t x, int32_t y) {
      int64_t mul = x;
      mul *= y;
      mul >>= 27;
    
      // Code may want to detect overflow here (not shown)
    
      return (int32_t) mul;
    }
    

    Even better, round the product to nearest (ties to even) before discarding the least significant bits. Simplifications are possible. Verbose code below as it is illustrative.

    int32_t mul_better(int32_t x, int32_t y) {
      int64_t mul = x;
      mul *= y;
      int32_t least = mul % ((int32_t)1 << 27);
      mul /= (int32_t)1 << 27;
      int carry = 0;
      if (least >= 0) {
        if (least >  ((int32_t)1 << 26) carry = 1;
        else if ((least ==  ((int32_t)1 << 26)) && (mul % 2)) carry = 1;
      } else {
        if (-least > ((int32_t)1 << 26) carry = -1;
        else if ((-least ==  ((int32_t)1 << 26)) && (mul % 2)) carry = -1;
      }
      return (int32_t) (mul + carry);
    }
    

    int32_t mul(int32_t x, int32_t y) {
      int64_t mul = x;
      mul *= y;
      return mul >> 27;
    }
    
    void foo(double x) {
      int32_t q = (int32_t) (x * (1 << 27));  // **
      int32_t taylor =
          c0 + mul(q, (c1 - mul(q, (c2  + mul(q,
          (c3 - mul(q, (c4 - mul(q, (c5 + mul(q, c6)))))))))));
      printf("%f %f\n", x,  taylor * 1.0 / (1 << 27));
    }
    
    int main(void) {
      foo(0.303609);
    }
    

    Output

    0.303609 0.584963
    

    ** Could round here too rather than simply truncate the FP to an integer.