Search code examples
rustcompressionhuffman-code

How to write already generated Huffman Codes to file in Rust


I am trying to implement Huffman encoding to compress files (for learning purpouses). I can generate Huffman codes for a given message and use that to build a binary string containing the compressed message. However I am stumped on how to save this "compressed" data. I want to have this data be as compact as possible (since its compression) but I am not sure if serde is the right approach. Even if it is what should I serialize as? (For example, something like Json would probably be bad for my use case since I do not care about human readability). Maybe I should just write the bytes to a file, but then I do not know how to include the lookup dictionary.

use bitvec::prelude::*;
use itertools::Itertools;
use std::{collections::HashMap, fmt, fs::File, io::{BufWriter, Read}, path::Path};
use serde::{Deserialize, Serialize, Serializer}

#[derive(Serialize, Deserialize, Debug)]
struct HuffmanData {
    dictionary: HashMap<String, char>,
    message: BitVec,
}

fn main() {
    let path = Path::new("data").join("romeo-juliet.txt") ;
    let file = File::open(path).expect("Could not open file");
    let huffman_data: HuffmanData = huffman(file);
    let save_path = Path::new("./compressed-data/").join("romeo-juliet.txt");
    let save_file = File::create(save_path).unwrap();
}

I have skipped the function definitions since they are not needed and are working as intended.


Solution

  • A bunch of questions there.

    A Huffman code for a symbol has a length measured in bits. When you write to a file, you write it in bytes, where each byte is eight bits. So how to connect those two? You need to consider the stream of bytes in the file as a stream of eight times that many bits. You will need to decide if the first bit of your bit stream is the most significant bit of the first byte or the least significant bit of the first byte. Let's pick the most significant bit.

    In your code you use an unsigned integer as a bit buffer. Let's say it's a 64-bit integer, a u64. You have a Huffman code you want to write that is k bits long. Use the bit operations, or and shift, to put those bits into your integer bit buffer, sliding up the bits already there. Also increment a counter that keeps track of how many bits are in the buffer, initialized to zero. So:

        buf = (buf << k) | code;
        bits += k;
    

    where code is an unsigned integer with the Huffman code in its low k bits, and zeros in the bits above that.

    If you keep doing that, you'll eventually lose bits off of the top of the buffer. Here is where we match up with the stream of bytes that is a file. Whenever the number of bits in the buffer is eight or more, we have enough to write a byte. So keep writing bytes, leaving no more than seven bits in the buffer unwritten:

        while bits >= 8 {
            bits -= 8;
            put(buf >> bits);
        }
    

    where put() writes one byte to your output. Do that every time you put bits into the bit buffer.

    A 64-bit buffer will permit up to 57-bit codes. That should cover any coding you're doing, since to generate a 58-bit code, you would need the frequencies to total at least 2,139,295,485,798. Huffman codes should be generated on chunks of data at a time, so that the codes can adapt to changes in the statistics of the data.

    When you're done, make sure to write out any leftover bits in the buffer into the top of the last byte, which may end up with some zero padding bits at the bottom. (More on how the decoder is to handle those padding bits below.)

        // flush
        if bits != 0 {
            put(buf << (8 - bits));
            bits = 0;
        }
    

    How to reverse this to read the bytes from a file and consume them as a stream of bits is then left as an exercise for the reader.

    Now for the other question. If you just write Huffman codes to the file, the recipient of the file will have no idea how to interpret the bits. They need the Huffman code description, preceding the Huffman-coded data. So how is that written to the output?

    There are many ways to encode the Huffman code into the output. Perhaps the easiest way that is still relatively compact is to traverse the tree that you constructed, recursively from the top, writing a 0 bit for each branch you encounter, or a 1 bit followed by the bits describing the symbol on that leaf (e.g. eight bits if you are Huffman coding a series of bytes). That is a self-terminating description, and so can immediately be followed by the Huffman-coded data. You would use the bits-to-bytes approach above to write that stream of bits to the output. There is no need to flush between the code description and the codes.

    A smaller representation can be had, with a little more attention to the Huffman code itself, by using a "canonical" Huffman code. For this code you save only the number of bits needed for the code for each symbol. Then you throw away the tree. The Huffman codes are then assigned to the symbols in order from the shortest codes to the longest codes, and then within each length, by a lexicographic ordering of the symbols. Then you only need to send those lengths to the recipient, not a description of the entire tree. You can find an example of such a scheme in the description of the deflate compressed data format.

    Lastly, you will need to be careful about your compressed data format and those 0..7 zero pad bits at the end in the last byte. How would the recipient know when the bit stream ends? Those last few zero bits could just as well be interpreted as one or more symbols, if your Huffman code has some short codes of a few bits each. To prevent this, you can either prepend your data with how many symbols should be decoded, or you can add an "end of stream" symbol to what you code, having a frequency of one. End your stream with the code for that symbol. Then the decoder will know to stop when it gets that code.