Short version: how can I detect overflow using the fixed-point multiplication described here but for a signed type?
Long version:
I still have some overflow issues with my Q31.32 fixed point type. To make it easier to work out examples on paper, I've made a much smaller type using the same algorithm, a Q3.4 based on sbyte. I figure that if I can work out all the kinks for a Q3.4 type, the same logic should apply for a Q31.32 one.
Note that I could very easily implement Q3.4 multiplication by performing it on a 16-bit integer, but I'm doing as if that didn't exist, because for the Q31.32 I'd need a 128-bit integer which doesn't exist (and BigInteger is too slow).
I want my multiplication to handle overflow by saturation, that is when overflow happens, the result is the highest or smallest value that can be represented depending on the sign of the operands.
This is basically how the type is represented:
struct Fix8 {
sbyte m_rawValue;
public static readonly Fix8 One = new Fix8(1 << 4);
public static readonly Fix8 MinValue = new Fix8(sbyte.MinValue);
public static readonly Fix8 MaxValue = new Fix8(sbyte.MaxValue);
Fix8(sbyte value) {
m_rawValue = value;
}
public static explicit operator decimal(Fix8 value) {
return (decimal)value.m_rawValue / One.m_rawValue;
}
public static explicit operator Fix8(decimal value) {
var nearestExact = Math.Round(value * 16m) * 0.0625m;
return new Fix8((sbyte)(nearestExact * One.m_rawValue));
}
}
And this is how I currently handle multiplication:
public static Fix8 operator *(Fix8 x, Fix8 y) {
sbyte xl = x.m_rawValue;
sbyte yl = y.m_rawValue;
// split x and y into their highest and lowest 4 bits
byte xlo = (byte)(xl & 0x0F);
sbyte xhi = (sbyte)(xl >> 4);
byte ylo = (byte)(yl & 0x0F);
sbyte yhi = (sbyte)(yl >> 4);
// perform cross-multiplications
byte lolo = (byte)(xlo * ylo);
sbyte lohi = (sbyte)((sbyte)xlo * yhi);
sbyte hilo = (sbyte)(xhi * (sbyte)ylo);
sbyte hihi = (sbyte)(xhi * yhi);
// shift results as appropriate
byte loResult = (byte)(lolo >> 4);
sbyte midResult1 = lohi;
sbyte midResult2 = hilo;
sbyte hiResult = (sbyte)(hihi << 4);
// add everything
sbyte sum = (sbyte)((sbyte)loResult + midResult1 + midResult2 + hiResult);
// if the top 4 bits of hihi (unused in the result) are neither all 0s or 1s,
// then this means the result overflowed.
sbyte topCarry = (sbyte)(hihi >> 4);
bool opSignsEqual = ((xl ^ yl) & sbyte.MinValue) == 0;
if (topCarry != 0 && topCarry != -1) {
return opSignsEqual ? MaxValue : MinValue;
}
// if signs of operands are equal and sign of result is negative,
// then multiplication overflowed upwards
// the reverse is also true
if (opSignsEqual) {
if (sum < 0) {
return MaxValue;
}
}
else {
if (sum > 0) {
return MinValue;
}
}
return new Fix8(sum);
}
This gives result accurate within the precision of the type and handles most overflow cases. It doesn't however handle these ones, for example:
Failed -8 * 2 : expected -8 but got 0
Failed 3.5 * 5 : expected 7,9375 but got 1,5
Let's work out how the multiplication happens for the first one.
-8 and 2 are represented as x = 0x80 and y = 0x20.
xlo = 0x80 & 0x0F = 0x00
xhi = 0x80 >> 4 = 0xf8
ylo = 0x20 & 0x0F = 0x00
yhi = 0x20 >> 4 = 0x02
lolo = xlo * ylo = 0x00
lohi = xlo * yhi = 0x00
hilo = xhi * ylo = 0x00
hihi = xhi * yhi = 0xf0
The sum is obviously 0 as all terms are 0 save for hihi, but only the lowest 4 bits of hihi are used in the final sum.
My usual overflow detection magic doesn't work here: the result is zero so the sign of the result is meaningless (e.g. 0.0625 * -0.0625 == 0 (by rounding down), 0 is positive yet signs of operands differ); also the high bits of hihi are 1111 which often happens even when there's no overflow.
Basically I don't know how to detect that overflow happened here. Is there a more general method?
This took me a long time, but I eventually figured everything out. This code is tested to work for every possible combination of x and y in the range allowed by sbyte. Here is the commented code:
static sbyte AddOverflowHelper(sbyte x, sbyte y, ref bool overflow) {
var sum = (sbyte)(x + y);
// x + y overflows if sign(x) ^ sign(y) != sign(sum)
overflow |= ((x ^ y ^ sum) & sbyte.MinValue) != 0;
return sum;
}
/// <summary>
/// Multiplies two Fix8 numbers.
/// Deals with overflow by saturation.
/// </summary>
public static Fix8 operator *(Fix8 x, Fix8 y) {
// Using the cross-multiplication algorithm, for learning purposes.
// It would be both trivial and much faster to use an Int16, but this technique
// won't work for a Fix64, since there's no Int128 or equivalent (and BigInteger is too slow).
sbyte xl = x.m_rawValue;
sbyte yl = y.m_rawValue;
byte xlo = (byte)(xl & 0x0F);
sbyte xhi = (sbyte)(xl >> 4);
byte ylo = (byte)(yl & 0x0F);
sbyte yhi = (sbyte)(yl >> 4);
byte lolo = (byte)(xlo * ylo);
sbyte lohi = (sbyte)((sbyte)xlo * yhi);
sbyte hilo = (sbyte)(xhi * (sbyte)ylo);
sbyte hihi = (sbyte)(xhi * yhi);
byte loResult = (byte)(lolo >> 4);
sbyte midResult1 = lohi;
sbyte midResult2 = hilo;
sbyte hiResult = (sbyte)(hihi << 4);
bool overflow = false;
// Check for overflow at each step of the sum, if it happens overflow will be true
sbyte sum = AddOverflowHelper((sbyte)loResult, midResult1, ref overflow);
sum = AddOverflowHelper(sum, midResult2, ref overflow);
sum = AddOverflowHelper(sum, hiResult, ref overflow);
bool opSignsEqual = ((xl ^ yl) & sbyte.MinValue) == 0;
// if signs of operands are equal and sign of result is negative,
// then multiplication overflowed positively
// the reverse is also true
if (opSignsEqual) {
if (sum < 0 || (overflow && xl > 0)) {
return MaxValue;
}
}
else {
if (sum > 0) {
return MinValue;
}
// If signs differ, both operands' magnitudes are greater than 1,
// and the result is greater than the negative operand, then there was negative overflow.
sbyte posOp, negOp;
if (xl > yl) {
posOp = xl;
negOp = yl;
}
else {
posOp = yl;
negOp = xl;
}
if (sum > negOp && negOp < -(1 << 4) && posOp > (1 << 4)) {
return MinValue;
}
}
// if the top 4 bits of hihi (unused in the result) are neither all 0s nor 1s,
// then this means the result overflowed.
sbyte topCarry = (sbyte)(hihi >> 4);
// -17 (-1.0625) is a problematic value which never causes overflow but messes up the carry bits
if (topCarry != 0 && topCarry != -1 && xl != -17 && yl != -17) {
return opSignsEqual ? MaxValue : MinValue;
}
// Round up if necessary, but don't overflow
var lowCarry = (byte)(lolo << 4);
if (lowCarry >= 0x80 && sum < sbyte.MaxValue) {
++sum;
}
return new Fix8(sum);
}
I'm putting all this together into a properly unit tested fixed-point math library for .NET, which will be available here: https://github.com/asik/FixedMath.Net