I am developing a Python program where speed is critical, and one of its core components is reading JPEG natural images, with resolutions between 128 X 128 and 512 X 512, from disk. Currently, I utilize jpeg4py, a libjpeg-turbo-based library that is, in my experience, faster than its competitors (e.g., PIL or OpenCV). However, according to a few preliminary experiments I conducted, by saving those images' array representations as pure raw data in binary format, the reading speed is accelerated by roughly 25% (the dataset's storage size ends up enlarging by nearly an order of magnitude, but that is not an issue for my application.).
Previously, I assumed the overhead introduced by JPEG's decoding scheme is made up for by the fact that the files being loaded are smaller and thus fewer bytes are read, but it appears I was wrong? Perhaps my findings would vary on different hardware? In that case, what would the maximal gap between the two methods be, e.g., is it circa 30%, or are more dramatic figures, such as 200%, possible?
Please note that the chief metric important to my usage is reading speed, and other considerations are secondary. Also, if there are faster alternatives for achieving my goal, I would very much appreciate it if you could mention them.
Thank you.
Perhaps my findings would vary on different hardware?
Yes. This is a common problem. There is a tread-off between the decompression speed and the reading from the storage device and the speed of the operation is dependent of the processor, the jpeg library performance, and the storage device speed. In fact, it is even more complex since the compression ratio can theoretically have an impact on the decompression speed and the size of the file. Additionally, the operating system can put the file in an in-memory cache so that the reads/writes are very cheap.
A solution to such problem is to adapt the compression ratio regarding the target storage device. NVMe SSD are often very fast so it is a good idea not to enable compression on such device unless space is a problem. SATA SSD are slower but still quite fast so the compression method need to be efficient. HDD are generally slow and so data often needs to be compressed so for reads/writes to be faster (note that some compression methods are still so slow that it decrease performance: for example, this is generally the case for the PNG file format that typically use the deflate decompression algorithm). Note that the processor and memory should theoretically taken into account but this is hard to estimate their performance without running a benchmark on the machine (some libraries use benchmark-based auto-tunning approach to improve performance).