Search code examples
performancearchitecturecompressionmulticoredisk

Given disk is slow and multiple cores does on the fly decompression make sense for performance?


It used to be that disk compression was used to increase storage space at the expense of efficiency but we were all on single processor systems back then.

These days there are extra cores around to potentially do the decompression work in parallel with processing the data.

For I/O bound applications (particularly read heavy sequential data processing) it might be possible to increase throughput by only reading and writing compressed data to disk.

Does anyone have any experience to support or reject this conjecture?


Solution

  • Take care not to confuse disk seek times and disk read rates. It takes millions of CPU cycles (5–10 milliseconds or 5–10 million nanoseconds) to seek to the right track on a hard drive (HDD). Once you're there, you can read tens of megabytes of data per second, assuming low fragmentation. For solid-state drives (SSD), seek times are lower (35,000–100,000ns) than HDDs.

    Whether or not the data is compressed on the disk, you still have to seek. The question becomes, is (disk read time for compressed data + the decompression time) < (disk read time for uncompressed data). Decompression is relatively fast, since it amounts to replacing a short token with a longer one. In the end, it probably boils down to how well the data was compressed and how big it was in the first place. If you're reading a 2KB compressed file instead of a 5KB original, it's probably not worth it. If you're reading a 2MB compressed file instead of a 25MB original, it likely is.

    Measure with a reasonable workload.