To use the vector units, e.g 512-bit wide for simultaneous operation on 8 double precision values, is it necessary to be single threaded and use AVX intrinsics ? If my program is not easy to vectorize, could i maybe get some of the benefit by launching 8 threads where each use 1 of the units ?
multi-threading and SIMD are are orthogonal; if your problem has large-scale parallelism, you can multi-thread. If it has SIMD-friendly parallelism, you can vectorize. Often you can do both, which is the whole point of xeon-phi.
Every CPU core in a multi-core CPU has its own set of vector execution units.
Using SIMD in each thread can mean you saturate memory bandwidth with only a couple threads instead of many, for problems that are limited by memory bandwidth, but each core has its own private L1/L2 cache (e.g. 256kiB L2 in Intel SnB-family cores). So if you can cache-block aka loop-tile appropriately, each thread can loop over a small chunk of your working set that stays hot in that core's local cache.
For problems that don't vectorize, yes it can certainly help to multithread. Each core is pretty much independent, though, so avoiding SIMD doesn't really help the per-thread performance of your individual threads.
This idea is mostly bogus:
could i maybe get some of the benefit by launching 8 threads where each use 1 of the units
It's not totally bogus, though: Hyperthreading does work better when the two threads sharing the same physical core are bottlenecked on something like memory latency or branch mispredicts (rather than ALU execution ports, cache size, or memory bandwidth).
For more low-level stuff, see Agner Fog's optimization guides, and other links in the x86 tag wiki.
Redesigning your data structures to be SIMD-friendly is often possible, but often requires large changes. Hopefully you used some wrappers to abstract access to your data structures, so you can change their layout without touching huge amounts of code.
For an example of redesigning code to be SIMD friendly, see the slides from a SIMD talk.