Search code examples
c++compressionlz77

Wrong encoding in LZ77


This function is somehow avoiding the ascii special characters such as space, new line sign or '\r', etc. It gives me wrong output basing on that implementation, implicating, that it is stricte problem with this signs.

Here's code for compression:

std::vector<Token> compression_lz77(const std::string &input, int searchBuffer, int lookAheadBuffer)
{
    int inputLength = input.length();
    int position = 0;

    std::vector<Token> data;

    while (position < inputLength)
    {
        Token token{};
        token.offset = 0;
        token.length_of_match = 0;
        token.code_word = input[position];

        int max_offset = (position < searchBuffer) ? position : searchBuffer;
        int max_search_length = (position + lookAheadBuffer) > inputLength ? inputLength - position : lookAheadBuffer;

        for (int offset = 1; offset <= max_offset; offset++)
        {
            int len = 0;
            while (len < max_search_length && input[position - offset + len] == input[position + len])
            {
                len++;
            }

            if (len > token.length_of_match)
            {
                token.offset = offset;
                token.length_of_match = len;
                token.code_word = input[position + len];
            }
        }

        data.push_back(token);
        position += token.length_of_match + 1;
    }

    return data;
}

And for my input I use:

The quick brown fox jumps over the lazy dog.
ABCABCABCDABCDEFABCDEFGABCDEFGHABCDEFGHI

Compressed output:

(0,0,T)(0,0,h)(0,0,e)(0,0, )(0,0,q)(0,0,u)(0,0,i)(0,0,c)(0,0,k)(6,1,b)(0,0,r)(0,0,o)(0,0,w)(0,0,n)(6,1,f)(5,1,x)(4,1,j)(16,1,m)(0,0,p)(0,0,s)(6,1,o)(0,0,v)(26,1,r)(5,1,t)(31,3,l)(0,0,a)(0,0,z)(0,0,y)(5,1,d)(15,1,g)(0,0,.)(0,0,
)(0,0,
)(0,0,A)(0,0,B)(0,0,C)(3,6,D)(4,4,E)(0,0,F)(6,6,G)(7,7,H)(8,8,I)

As you can see, there's bunch of empty trios (I mean <0,0,null>, and weird transfer of the right bracket to the new line, I guess implicating a new line character, but weirdly coded.

Using decompression alghorithm, which I pressume is implemented correctly, I recreate the string to the output file:

The)quick)brown)fox)jumps)over)the)lazy)dog.))ABCABCABCDABCDEFABCDEFGABCDEFGHABCDEFGHI

Can someone lend me a helpful hand how to manage those special chars?

Token structure:

struct Token
{
    int offset;
    int length_of_match;
    char code_word;
};

Decompression function:

std::string decompression_lz77(const std::vector<Token> &compressedData)
{
    std::string decompressed;

    for (const Token &token : compressedData)
    {
        if (token.offset == 0)
        {
                decompressed += token.code_word;
        }
        else
        {
            int startPos = decompressed.length() - token.offset;
            int endPos = startPos + token.length_of_match;

            for (int i = startPos; i < endPos; ++i)
            {
                decompressed += decompressed[i];
            }

            decompressed += token.code_word;
        }
    }

    return decompressed;
}

And also solve function, which can be treated as main, since in main function I'm only parsing the function and parameters from command line:

int solve(const CompressionParams &params)
{
    if (params.inputFileName.empty() || params.outputFileName.empty() || params.mode.empty() || params.inputBufferSize <= 0 || params.historyBufferSize <= 0)
    {
        std::cout << "Niepoprawne parametry linii polecen. Prosze podac wszystkie wymagane opcje." << std::endl;
        printInstructions();
        return 1;
    }

    std::ifstream inputFile(params.inputFileName, std::ios::binary);

    if (!inputFile.is_open())
    {
        std::cerr << "Blad podczas otwierania pliku wejsciowego: " << params.inputFileName << std::endl;
        return 1;
    }

    inputFile.seekg(0, std::ios::end);
    std::streampos fileSize = inputFile.tellg();
    inputFile.seekg(0, std::ios::beg);

    std::vector<char> fileContent(fileSize);
    inputFile.read(fileContent.data(), fileSize);
    inputFile.close();

    std::string data(fileContent.begin(), fileContent.end());
    std::vector<Token> arr;

    if (params.mode == "c")
    {
        if (!fileContent.empty())
        {
            arr = compression_lz77(data, params.inputBufferSize, params.historyBufferSize);

            std::ofstream outputFile(params.outputFileName);
            if (!outputFile.is_open())
            {
                std::cerr << "Wystapil blad podczas otwierania pliku wyjsciowego." << std::endl;
                return 1;
            }

            for (const auto &token : arr)
            {
                outputFile << "<" << token.offset << "," << token.length_of_match << "," << token.code_word << ">";
            }
            outputFile.close();

            auto startingSize = static_cast<double>(fileContent.size());
            double compressedSize = static_cast<double>(arr.size()) * (sizeof(Token) / sizeof(char));
            double wspolczynnik = (startingSize / compressedSize) * 100.0;

            std::cout << "Teoretyczny wspolczynnik kompresji: " << wspolczynnik << "%" << std::endl;
            std::cout << "Teoretyczny stopien kompresji " << (compressedSize / startingSize) << std::endl;
        }
    }
    else if (params.mode == "d")
    {
        std::ofstream outputFile(params.outputFileName, std::ios::binary);
        if (!outputFile.is_open())
        {
            std::cerr << "Wystapil blad podczas otwierania pliku wejsciowego." << std::endl;
            return 1;
        }

        std::string input(fileContent.begin(), fileContent.end());
        std::vector<Token> compressed_data;
        size_t pos = 0;

        while (pos < input.size())
        {
            size_t nextPos = input.find('>', pos);
            if (nextPos == std::string::npos)
            {
                break;
            }

            std::string tokenStr = input.substr(pos, nextPos - pos + 1);

            Token t{};
            std::istringstream tokenStream(tokenStr);
            char dummy;
            tokenStream >> dummy >> t.offset >> dummy >> t.length_of_match >> dummy >> t.code_word;
            compressed_data.push_back(t);
            pos = nextPos + 1;
        }

        for (const Token &token : compressed_data)
        {
            std::cout << '<' << token.offset << ',' << token.length_of_match << ',' << token.code_word << '>';
        }
        std::cout << std::endl;

        std::string decompressed_output = decompression_lz77(compressed_data);
        outputFile.write(decompressed_output.c_str(), decompressed_output.size());
        outputFile.close();
    }
    else
    {
        std::cerr << "Niepoprawny tryb. Proszę uzyc 'c' dla kompresji lub 'd' dla dekompresji." << std::endl;
        printInstructions();
        return 1;
    }
    return 0;
}

Everything needed is included like #string, #vector, etc.

Parsing it as values: searchBuffer: 256, lookAheadBuffer: 4096


Solution

  • The problem is in this line:

    tokenStream >> dummy >> t.offset >> dummy >> t.length_of_match >> dummy >> t.code_word;
    

    See cppreference on operator>>(std::basic_istream): it is a formatted input function, which skips whitespace characters. So, it will not fetch those whitespace characters into t.code_word. Instead, it fetches the next non-whitespace character ('>') that follows. In other words, >> will not extract a space or a newline character (i.e., all those that are considered whitespace) into a char.

    Maybe, since you are already writing offset and length_of_match values into the "compressed" output file as textual representation of numbers, not as raw byte values, you want to also write code_word values as numbers, not as characters?

    For example, replace this line

    outputFile << "<" << token.offset << "," << token.length_of_match << "," << token.code_word << ">";
    

    with this:

    outputFile << "<" << token.offset << "," << token.length_of_match << "," << static_cast<int>(token.code_word) << ">";
    

    And then replace this line

    tokenStream >> dummy >> t.offset >> dummy >> t.length_of_match >> dummy >> t.code_word;
    

    with this:

    int temp = 0;
    tokenStream >> dummy >> t.offset >> dummy >> t.length_of_match >> dummy >> temp;
    t.code_word = static_cast<char>(temp);
    

    And also, replace this line

    std::cout << '<' << token.offset << ',' << token.length_of_match << ',' << token.code_word << '>';
    

    with this:

    std::cout << '<' << token.offset << ',' << token.length_of_match << ',' << static_cast<int>(token.code_word) << '>';
    

    The "compressed" file will then contain characters' codes instead of raw characters, of course (i.e., <0,0,84><0,0,104><0,0,101>… instead of <0,0,T><0,0,h><0,0,e>…), but the decompression should succeed.


    An update (to address the question asked in the comment): generally speaking, compression is all about reducing the redundancy found in the input data. There is too little redundancy found in the famous pangram about quick brown fox, so it's not well compressible. For example, if you zip a text file containing only the The quick brown fox jumps over the lazy dog. phrase, and look into the resulting zip file (using either a hex editor or even a regular text editor), you'll see this phrase in its uncompressed form inside the file (among other metadata). The ABCABC…, on the other hand, has much redundancy, so is well compressible.

    output weighs more than input

    That's why I've put "compression" into quotation marks. To be really useful as a compression, the output should be saved not in a textual representation, i.e., not using <, ,, >, and instead of writing numbers like 126, it should write raw bytes (i.e., in case of 126, it'll look like ~).

    Also, all this has to be done in some more efficient way (because even raw bytes would waste precious space if stored as-is every time). For example, the Deflate algorithm (which is commonly used in zip- and gzip-files, as well as in PNG-images) combines LZ77 with Huffman coding.