Search code examples
compressionlz4

Can lz4 find continuous repeated substrings?


I looked into the link to understand how lz4 works. I wrote the following test. We can see sinInput2 cannot be compressed because it has only random data. sinInput1's size is reduced to 1/8. I guess this is because its data has 8 different blocks. Does it mean lz4 can find repeated substrings? Does it have any limit of how long a substring can be found?

#include <string>
#include <iostream>
#include "lz4.h"
#include <stdio.h>      /* printf, NULL */
#include <stdlib.h>     /* srand, rand */
#include <time.h>       /* time */
#include <math.h>
int main() {
    using namespace std;

    srand(time(NULL));
    double sinInput1[1024];
    double sinInput2[1024];
    for (int i = 0; i < 1024; ++i) {
        sinInput1[i]=sin(i % 128); 
        sinInput2[i]=sin(i);
   }

    int inputSize = 1024 * sizeof(double);
    cout << "Input size: " << inputSize << endl;
    char *compressedData = new char[2 * inputSize];

    cout << "Compressed size of sinInput1: " << LZ4_compress_fast((char*)sinInput1, compressedData, inputSize, inputSize*2, 1) << endl;
    cout << "Compressed size of sinInput2: " << LZ4_compress_fast((char*)sinInput2, compressedData, inputSize, inputSize*2, 1) << endl;

    return 0;
}

Input size: 8192
Compressed size of sinInput1: 1064
Compressed size of sinInput2: 8222

Solution

  • LZ4 can find repeated substring. It's even very easy for it to find them.

    The repeated substring can have any length, up to 64 KB. In practice, most repeated patterns are <= 4 bytes, and the most common of them is a bunch of zeros. But there are occasional outliers.

    Once a pattern is found, it can be repeated any number of times. LZ4 has no format limit on the size of matches.