It seems in vogue to predict that superscalar out-of-order CPUs are going the way of the dodo and will be replaced by huge amounts of simple, scalar, in-order cores. This doesn't seem to be happening in practice because, even if the problem of parallelizing software were solved tomorrow, there's still tons of legacy software out there. Besides, parallelizing software is not a trivial problem.
I understand that GPGPU is a hybrid model, where the CPU is designed for single-thread performance and the graphics card for parallelism, but it's an ugly one. The programmer needs to explicitly rewrite code to run on the graphics card, and to the best of my understanding expressing parallelism efficiently for a graphics card is much harder than expressing it efficiently for a multicore general-purpose CPU.
What's wrong with a hybrid model where every PC comes with one or two "expensive" superscalar out-of-order cores and 32 or 64 "cheap" cores, but with the same instruction set as the expensive cores and possibly on the same piece of silicon? The operating system would be aware of this asymmetry and would schedule the out-of-order cores first and with the highest priority threads. This prioritization might even be explicitly exposed to the programmer via the OS API, but the programmer wouldn't be forced to care about the distinction unless he/she wanted to control the details of the scheduling.
Edit: If the vote to close is because this supposedly isn't programming related, here's a rebuttal: I think it is programming-related because I want to hear programmers' perspective on why such a model is a good or bad idea and whether they would want to program to it.
W00t, what a great question =D
In a glance, I can see two problems. I advice you that for now on I'll be considering a CPU bounded parallel application when exposing my arguments.
The first one is the control overhead imposed to the operating system. Remember, the OS is responsible for dispatching the processes to the CPU they will run on. Moreover, the OS needs to control the concurrent access to the data structures that holds this information. Thus, you got the first bottleneck of having the OS abstracting the schedule of tasks. This is already a drawback.
The following is a nice experiment. Try to write an application that makes a lot of use of the CPU. Then, with some other application, like atsar, get the statics of user and system time. Now, vary the number of concurrent threads and look to what happens to the system time. Plotting data may help to figure the growth of (not so =) useless processing.
Second, as you add cores to your system, you also need a more powerful bus. CPU cores need to exchange data with the memory so a computation may be done. Thus, with more cores, you'll have more concurrent access to the bus. Someone may argue that a system with more than one bus can be designed. Yes, indeed, such a system may be designed. However, extra mechanisms must be in place to keep the integrity of the data used by the cores. Some mechanism do exist at cache-level, however they are very very expensive to be deployed in the primary memory-level.
Keep in mind that every time a thread changes some data in the memory, this change must be propagated to others threads when they access this data, an action that is usual in parallel applications (mainly in numerical ones).
Nevertheless, I do agree with your position that the current models are ugly. And yes, nowadays is much more difficult to express parallelism in GPGPU programming models as the programmer is totally responsible for moving bits around. I anxiously hope for more succinct, high-level and standardized abstraction for many-core and GPGPU application development.