Why is the performance of non-superscalar parts of a superscalar processor affected?

In the second-last paragraph of the ILP section of Wikipedia's CPU article:

In the case where a portion of the CPU is superscalar and part is not, the part which is not suffers a performance penalty due to scheduling stalls. The Intel P5 Pentium had two superscalar ALUs which could accept one instruction per clock cycle each, but its FPU could not accept one instruction per clock cycle. Thus the P5 was integer superscalar but not floating point superscalar.

What is a scheduling stall? Why does the performance of the non-superscalar part of the CPU suffer from it?

Is this saying that the scalar part is slower than it would be if the rest of the CPU was scalar?

Solution

I hadn't heard the term "scheduling stall" before, but it sounds like it's just saying that the pipeline will bottleneck on the scalar part.

The scalar part still runs at its max throughput. So I think the wording of that wikipedia article is misleading: "the part which is not suffers a performance penalty" certainly makes it sound like the scalar part will not achieve its own max throughput.

I guess this counts as a "stall" if the superscalar part of the CPU is expecting to issue 2 instructions per cycle but it's only able to issue 1 because there aren't execution resources available.