Optimized FFT and mathematics for AT91SAM9 ARM processor Linux userspace program

I'm developing C/C++ software for an embedded Linux system with the AT91SAM9G20 processor from Atmel. I need to quickly compute the FFT using fixed-point (or perhaps floating-point) math using a Linux userspace program. I understand that assembler might be the way to go here with respect to the implementation, and that an additional -mpcu switch might be required when compiling using the gcc compiler. What is the best way to proceed with this implementation, and are there any good book references or optimized FOSS libraries available?

I have to implement some algorithms that also require small FFT lengths (i.e 1024 points) to be applied a number of times and I would wonder if some libraries (such as kissfft) would work just as well. I'm also interested in long FFT lengths, so the FFTW as suggested in an answer below would work well too.

As a related aside to this question, I am also wondering how integer division is handled in an ARM9 Linux userspace program. If I divide two integers (such as 25 / 4), is the division done using soft floating point numbers? I need to also implement some heavy number crunching algorithms, and I am wondering if fixed-point is better to use here than floating point math, and how the gcc compiler really handles things.

Solution

FFTw contains CPU specific optimizations (and can do compile time/runtime CPU profiling too).

Version 3.3.1 introduces support for the ARM Neon extensions

http://www.fftw.org/#features

And from the FAQ: Question 4.2. Why is FFTW so fast?

This is a complex question, and there is no simple answer. In fact, the authors do not fully know the answer, either. In addition to many small performance hacks throughout FFTW, there are three general reasons for FFTW's speed.

FFTW uses a variety of FFT algorithms and implementation styles that can be arbitrarily composed to adapt itself to a machine. See Q4.1 `How does FFTW work?'.

FFTW uses a code generator to produce highly-optimized routines for computing small transforms.

FFTW uses explicit divide-and-conquer to take advantage of the memory hierarchy.