Efficiently implementing DXT1 texture decompression in hardware

DXT1 compression is designed to be fast to decompress in hardware where its used in texture samplers. The Wikipedia article says that under certain circumstances you can work out the co-efficients of the interpolated colours as:

c2 = (2/3)*c0+(1/3)*c1

or rearranging that:

c2 = (1/3)*(2*c0+c1)

However you re-arrange the above equation, then you end up always having to multiply something by 1/3 (or dividing by 3, same deal even more expensive). And it seems weird to me that a texture format which is designed to be fast to decompress in hardware would require a multiplication or division. The FPGA I'm implementing my GPU on only has limited resources for multiplications and I want to save those for where they're really required.

So am I missing something? Is there an efficient way of avoiding the multiplications of the colour channels by a 1/3? Or should I just eat the cost of that multiplication?

Solution

This might be a bad way of imagining it, but could you implement it via the use of addition/subtraction of successive halves (shifts)?

As you have 16 bits this gives you the ability to get quite accurate with successive additions and subtractions.

A third could be represented as

a(n+1) = a(n) +/- A>>1, where, the list [0, 0, 1, 0, 1, etc] shows whether to add or subtract the shifted result.

I believe this is called fractional maths.

However, in FPGAs, it is difficult to know whether this is actually more power efficient than the native DSP blocks (e.g. DSP48E1) provided.