Improved image storage/import performance on HDD

Question

Hello, I have a question about threadpool, HDD read/write simultaneously. It's my first time leaving a question, so I am sorry in advance because the writing is lengthy...

On one PC,
The Image processing and image storage programs, and The image loading program is running.

If the image storage and image import operations are running simultaneously on one HDD, the image processing operation seems to slow down.

HDD has only one disk head, so I know it's the fastest to do only one move at a time... There is nothing we can do about this part, so I want to minimize or slowdown.

Next, the development environment and implementation situation.

I worked with MFC + OpenCV (Windows 10.0.19044)

The image processing program is repeated every time an instruction is received and is running 24 hours a day. The image is 16384 * 40000 pixels * 1bytes 2 sheets. Since it is a high-capacity image, both image processing and image storage after image area division are performed in a thread pool.

The image loading program operates when the user needs it. When inquiring, DB inquires video information and retrieves images from HDD.

The PC is equipped with SSD and two HDDs (13TB) The processor is i9-12900KF, 16core, 24thread.

Any job is taken out by queuing it, and both image processing and image storage jobs are processing on the one thread pool.

I share the same thread pool and use it, so I guess that during image storage, the number of threads used for image processing decreases.

I set the number of threads at 40 for both programs. There's no particular reason. I heard that we need to catch it efficiently depending on the number of cores, but I am considering it.

I store the image in png format and jpg format respectively.

The default action for image loading is to load the file into a small jpg and the function is divided so that the user can load it directly into png if necessary.

When saving a split image, The image encoding operation is performed simultaneously in the thread pool Memory -> hdd transmissions are sequentially transmitted one by one in a single thread.

For image loading, hdd -> memory is loaded one by one sequentially The image decoding operation is performed simultaneously in the thread pool.

The image processing result should be stored in the DB, and the result should be sent quickly.

It doesn't matter if the image storage is slowed down. The image loading action is not satisfactory to the user, but it can be compromised to some extent. (Still, I want to implement it to give the result as soon as possible...)

So what I thought

If image storage/importing threads lower thread priority, will image processing threads do more work and work?
Is it meaningful to divide the thread pool for image storage/image processing instead of one thread pool?
Why don't you save the image on SDD, create a separate service program, and send it slowly to the HDD?
Actually, isn't there a problem with the disk?

1, 2, will be developed, and released. (It is difficult to reproduce problems in the office...)

The third method is to write to an HDD in SDD, write to an HDD at once, and overlap with the HDD reading I think it's just the development that gets complicated. However, it is significantly faster than HDD when storing images.

In the case of number 4, jpg is not slow to load images due to the low file capacity... The process of decoding is slow. I thought it would have nothing to do with HDD from the decoding stage.

So, both programs have 40 threads in the thread pool The image import program reduced the number of threads to two and sent an update, but it was reported that the image import operation was only slow and the issue remained.

The situation is complicated and there are many suspicious things, but I'm asking you because I think there are parts that I don't know or have errors...

Solution

First of all, you use a thread pool with far more threads than the number of cores on the i9-12900KF processor. Having two threads running on the same physical core generally cause them to be slower. If they run on the same logical core, then they cannot run simultaneously (they will be constantly interrupted). In fact, even if they run on different physical cores, one thread can significantly slow down another if it intensively make use of the L3 cache or the memory which is likely your case. Operating on a large buffer can causes cache lines of the cache of other cores to be evicted and thus reloaded later. This is known as cache trashing. This problem can become critical with non-contiguous loads/stores.

The target processor is a big-little one so the scheduling of threads on such a processor is more complex than usual. In fact, many libraries do not support well such architecture yet (they are not running efficiently). Even OS stacks are barely suited for such kind of architecture (at least on Windows and Linux). The number of threads per core is not the same for all core: big core can execute 2 threads simultaneously (sharing available resources) while little core can only execute 1 thread at a time. It is worth noting that the frequency of the little core is not the same than the big core: 2.4 GHz VS 3.2 GHz for the base frequency and 3.9 GHz vs 5.1 for the turbo frequency). Regarding the scheduling of the thread to the core, the performance of the target thread can change.

The frequency of the cores running the threads is dependent of the number of cores used and the work done on each cores. For example, running a computationally intensive code using the FP AVX-2 units (or the non-officially supported AVX-512 units) on a core can significantly reduce the frequency of other cores. The higher the number of active core, the lower the frequency. Dynamic frequency stalling affect the scalability of application but this scaling is important for the processor to fulfil its power budget (and not melt too).

Caching also matters a lot. Indeed, mainstream OS tends to put HDD read/written data in memory so to operate faster. This requires some additional memory which is not considered allocated. When a process request a large amount of memory, the OS flush/invalidate the IO cache regarding the requested space and later accesses cause data to be reloaded from the storage device (much slower). The solution is to check the amount fully available memory (the part not cached) and not to use too much memory if the remaining space is used by the storage device cache.

Having two thread doing IO operations is generally not faster than 1 thread on HDD (especially with 1 head). Some OS stacks use locks if not even a giant lock. Because of that, one loading thread with asynchronous IO can be faster than blocking IO on one/multiple threads. Indeed, the OS can reorder requests so they can be more contiguous in that case (so to reduce the seek time by loading data on the way).