Search code examples
cfixed-point

How can I improve fixed-point data type utilization?


I'm trying to use the quantization for a convolutional neural network in order to reduce memory occupation going from the FP32 bit data type to Int16 one. The problem is that I'm obtaining poor results and since it's the first time that I use this kind of representation I have some doubts about the correct implementation.

First of all, I'm quantizing both the input data and the weights using the following functions (uniform quantization):

#define FXP 16

int16_t quantize(float a, int fxp){
    int32_t maxVal = ((1 << (FXP-1)) - 1);
    int32_t value = a * (1 << fxp); //mapping
    
    //rounding
    if (a>=0){
        value += 0.5f;
    }else{
        value -= 0.5f;
    }
    
    //clipping
    if(value > maxVal){
        return (int16_t)maxVal;
    }else if(value < -maxVal){
    
        return -(int16_t)maxVal;
    }else{
        return (int16_t)value;
    }
}


int16_t value = quantize(test_data[i],10);

In this case I'm using a Q5.10 format (from the data I have it seems the best format to use). Once all numbers have been converted, arithmetic within the network (multiplications and sums/subtractions - for example used in convolutions), is implemented in this way:

  for(int k=0; k<output_fea; k++){
      int32_t accumulator = 0;
      
      for(int l=minimum; l<maximum; l++){
        for(int j=0; j<input_fea; j++){
          accumulator += (data[l][j]*weights[k][l][j] + (1<<((FXP_VALUE-1))))>>FXP_VALUE; //both data and weights array are int16_t
        }
      }
      
      //before going from int32_t to int16_t
      if(accumulator>INT16_MAX){
        accumulator=INT16_MAX;
      }else if(accumulator<INT16_MIN){
        accumulator=INT16_MIN;
      }

      result[i][k] = (int16_t)ReLU(accumulator); //result is int16_t
    }  
  }

Is it correct what I am doing ? Are there any steps I could take to improve the results and reduce approximations ?


Solution

  • You should check how much error is introduced into your values by rounding and clipping. Continue working with floating-point values, but introduce just rounding; then introduce just clipping; then introduce both. How much error is introduced in your results?

    Also, regarding fixed-point format: even if it seems the best format to use, maybe it's not the best. Try different formats; check the error in results for each format. Try using different formats at different stages of calculation (i.e. at different layers). Each application has its own problems, so you have to gather intuition for how much rounding and clipping (separately) affect your results.

    If your results are very sensitive to rounding errors, you might want to use int16 for some stages and float32 for others.