erlang multicore parallel-processing performance

Reasons of sub-linear speedup in parallel programs

What are the reasons a parallelized program doesn't achieve the ideal speedup?

For example, I have thought about data dependencies, the cost of data transfer between threads (or actors), synchronisation for access to the same data structures, any other ideas (or subcategories of the reasons i mentioned)?

I'm particularly interested for problems occurring in the erlang actor model but any other issues are welcomed.

Solution

A few in no particular order:

Cache line sharing - multiple variables on the same cache-line can incur overhead between processors, even if the theoretical model says they should be independent.
Context switch overhead - if you have more threads than cores, there will be overhead in context switching.
Kernel scalability issues: kernels may be fine at say 4 cores, but less efficient at 8.
Lock conveying
Amdahl's law - The limit of the parallel speed up of a program is the proportion of the program that can parallelized.