I'm working on a fluid dynamics Navier-Stokes solver that should run in real time. Hence, performance is important.
Right now, I'm looking at a number of tight loops that each account for a significant fraction of the execution time: there is no single bottleneck. Most of these loops do some floating-point arithmetic, but there's a lot of branching in between.
The floating-point operations are mostly limited to additions, subtractions, multiplications, divisions and comparisons. All this is done using 32-bit floats. My target platform is x86 with at least SSE1 instructions. (I've verified in the assembler output that the compiler indeed generates SSE instructions.)
Most of the floating-point values that I'm working with have a reasonably small upper bound, and precision for near-zero values isn't very important. So the thought occurred to me: maybe switching to fixed-point arithmetic could speed things up? I know the only way to be really sure is to measure it, that might take days, so I'd like to know the odds of success beforehand.
Fixed-point was all the rage back in the days of Doom, but I'm not sure where it stands anno 2010. Considering how much silicon is nowadays pumped into floating-point performance, is there a chance that fixed-point arithmetic will still give me a significant speed boost? Does anyone have any real-world experience that may apply to my situation?
As other people have said, if you're already using floating-point SIMD, I doubt you'll get much improvement with fixed point.
You said that the compiler is emitting SSE instructions, but it doesn't sound like you've tried writing your vectorized SSE code. I don't know how good the compilers usually are at that, but it's something to investigate.
Two other areas to look at are:
Memory access - if all your computations are done in SSE, then cache misses might be taking up more time than the actual math.
Unrolling - you should be able to get a performance benefit from unrolling your inner loops. The goal is not (as many people think) to reduce the number of loop termination checks. The main benefit is to allow independent instructions to be interleaved, to hide the instruction latency. There a presentation here entitled VMX Optimization: Taking it up a Level which might help a bit; it's focused on Altivec instructions on Xbox360, but some of the unrolling advice might help on SSE as well.
As other people have mentioned, profile, profile, profile. And then let us know what's still slow :)
PS - on one of your other posts here, I convinced you to use SOR instead of Gauss-Seidel in your matrix solver. Now that I think about it, is there a reason that you're not using a tri-diagonal solver?