c#.net multithreading performance hyperthreading

Why would a fully CPU bound process work better with hyperthreading?

Given:

An entirely CPU bound very large (i.e., more than a few CPU cycles) job, and
A CPU with four physical and a total of 8 logical cores,

should 8, 16, and 28 threads perform better than 4? I understand that four threads would have lesser context switches to perform and lesser overhead than 8, 16, or 28 threads would have on a four physical core machine. However, the timings are -

Threads    Time Taken (in seconds)
   4         78.82
   8         48.58
   16        51.35
   28        52.10

The code used to test get the timings is mentioned in the Original Question section below. The CPU specifications are also given at the bottom.

Original Question

What does it mean when we say

Hyper-threading works by duplicating certain sections of the processor—those that store the architectural state—but not duplicating the main execution resources. This allows a hyper-threading processor to appear as the usual "physical" processor and an extra "logical" processor to the host operating system

This question is asked on SO today, and it tests the performance of multiple threads doing the same work. It has the following code:

private static void Main(string[] args)
{
    int threadCount;
    if (args == null || args.Length < 1 || !int.TryParse(args[0], out threadCount))
        threadCount = Environment.ProcessorCount;

    int load;
    if (args == null || args.Length < 2 || !int.TryParse(args[1], out load))
        load = 1;

    Console.WriteLine("ThreadCount:{0} Load:{1}", threadCount, load);
    List<Thread> threads = new List<Thread>();
    for (int i = 0; i < threadCount; i++)
    {
        int i1 = i;
        threads.Add(new Thread(() => DoWork(i1, threadCount, load)));
    }

    var timer = Stopwatch.StartNew();
    foreach (var thread in threads) thread.Start();
    foreach (var thread in threads) thread.Join();
    timer.Stop();

    Console.WriteLine("Time:{0} seconds", timer.ElapsedMilliseconds/1000.0);
}

static void DoWork(int seed, int threadCount, int load)
{
    var mtx = new double[3,3];
    for (var i = 0; i < ((10000000 * load)/threadCount); i++)
    {
         mtx = new double[3,3];
         for (int k = 0; k < 3; k++)
            for (int l = 0; l < 3; l++)
              mtx[k, l] = Math.Sin(j + (k*3) + l + seed);
     }
}

(I have cut out a few braces to bring the code to a single page for quick readability.)

I ran this code on my machine to replicate the issue. My machine has four physical cores and eight logical ones. The method DoWork() in the code above is completely CPU bound. I felt hyper-threading could contribute to maybe a 30% speedup (because here we have as many CPU-bound threads as the physical cores (i.e., 4)). But it nearly does attain a 64% performance gain. When I ran this code for four threads, it took about 82 seconds, and when I ran this code for 8, 16, and 28 threads, it ran in all the cases in about 50 seconds.

To summarize the timings:

Threads    Time Taken (in seconds)
   4         78.82
   8         48.58
   16        51.35
   28        52.10

I could see that CPU usage was ~50% with four threads. Shouldn't it be ~100%? After all, my processor has only four physical cores. And the CPU usage was ~100% for 8 and 16 threads.

I am trying to understand Why would a fully CPU-bound process work better with hyperthreading?.

For the sake of completion,

I have Intel Core i7-4770 CPU @ 3.40 GHz, 3401 MHz, 4 Core(s), 8 Logical Processor(s).
I ran the code in Release mode.
I know that the way timings are measured is bad. This will only give time for the slowest thread. I took the code as it is from the other question.

Solution

I could see that CPU usage was ~50% with 4 threads. Shouldn't it be ~100%?

No, it shouldn't.

what is the justification for 50% CPU usage when running 4 CPU bound threads on a 4 physical core machine?

This is simply how CPU utilization is reported in Windows (and on at least some other OS's too, by the way). A HT CPU shows up as two cores to the operating system, and is reported as such.

Thus, Windows sees an eight-core machine, when you have four HT CPUs. You'll see eight different CPU graphs if you look at the "Performance" tab in Task Manager, and the total CPU utilization is computed with 100% utilization being the full utilization of these eight cores.

If you are only using four threads, then these threads cannot fully utilize the available CPU resources and that explains the timings. They can, at most, use four of the eight cores available and so of course your utilization will max out at 50%. Once you go past the number of logical cores (8), runtime increases again; you are adding scheduling overhead without adding any new computational resources in that case.

By the way…

HyperThreading has improved quite a lot from the old days of shared cache and other limitations, but it will still never provide the same throughput benefit that a full CPU could, as there remains some contention within the CPU. So even ignoring OS overhead, your 35% improvement in speed seems pretty good to me. I often see no more than a 20% speed-up adding the extra HT cores to a computationally-bottlenecked process.