I'm not looking for a portable SIMD implementation.
All I need is: a bit-accurate implementation. Performance doesn't matter very much as long as it's not extremely slow.
I want to use it for early stage developing and testing, so that I can compile and run on a host computer for the first 10+ iterations. Then cross-compile and fine tune performance on the ARM target.
I'm pretty used to this development cycle when I work with TI DSP like described here . I want to carry this on when I move to ARM NEON.
Is this already done, or do I need to invent the wheel?
Intel has a useful set of macros, neon2sse.h
which translate NEON intrinsics to SSE. This enables you to build and test your C/C++ code with NEON intrinsics on an x86 platform.