Search code examples
pythonpython-3.xcompressionhuffman-code

Python - Including frequency table at the beginning of a Huffman-compressed file


I am trying to implement Huffman compression and decompression of files, where all the information needed to decompress must be included in the compressed file. For this implementation I want to include the frequency table in the compressed file, such that the decompression program can rebuild the Huffman codes from this frequency table and then decompress the file. The frequency table looks something like this, where each index maps to an ASCII-character's decimal representation:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 847, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4183, 13, 0, 0, 0, 6, 0, 0, 26, 26, 0, 107, 84, 598, 124, 36, 72, 66, 42, 21, 8, 16, 9, 11, 10, 10, 46, 0, 0, 7, 0, 3, 0, 21, 30, 4, 20, 19, 30, 5, 34, 35, 0, 9, 19, 15, 7, 10, 9, 0, 8, 15, 19, 1, 9, 8, 2, 1, 8, 24, 29, 24, 23, 8, 0, 439, 189, 40, 252, 1514, 226, 241, 82, 462, 62, 353, 346, 306, 521, 436, 212, 0, 977, 512, 663, 100, 176, 24, 10, 53, 9, 23, 374, 23, 2, 0, 197, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 65, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 90, 0, 124, 0, 0, 75, 14, 0, 0, 49, 0, 33, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 66, 0, 0, 34, 0, 0, 0, 0, 0, 0, 157, 154, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 49, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 200, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

I.e., index 32 of the list is 4183, which tells me that SPACE (ASCII# 32) appears 4183 times in the compressed file.

I also have code in place to create the Huffman codes and convert each character into its Huffman code and append it to a long bitstring. The following code is functional, and it converts the bitstring into a byte-array and saves it as a binary-file:

byte_array = bytearray()
for i in range(0, len(bitstring), 8):
  byte = bitstring[i:i + 8]
  byte_array.append(int(byte, 2))

with open(output_file_path, "wb") as compressed_file:
  compressed_file.write(bytes(byte_array))

The resulting binary file is compressed from 17 KB to 10 KB successfully.

My problem is trying to include the frequency table at the beginning of this compressed file. I have tried several solutions but I run into problems and feel quite stuck.

Is there a simple way to include a frequency table such as above to the beginning of a compressed file in Python? Any tips for methods or functions that can be used to achieve this would be greatly appreciated.

I would want to achieve this with the frequency-table as-is and not using a Canonical Huffman code. And again, the compressed file alone and no other information must be enough to decompress the file without loss.

I have tried several function and methods that I find, but I am quite new to working with bytes and every method I have tried, such as converting the list to a bytearray, have failed. Because the list includes integers > 255 it won't convert to a byte-array like the bitstring does.

EDIT:

I am now sending the Huffman tree instead of the frequency table as suggested, but the tree is not rebuilt completely as it should be. Most of the leaf nodes are placed in the correct spot, but not all.

The following code creates the Huffman codes and at the same time creates the bitstring representing the Huffman tree:

def __create_huffman_codes(self, current_node, current_huffman_code):
  if not current_node:
    return
  self.huffman_tree_binary += "0"
  if current_node.char:
    self.huffman_tree_binary += "1"
    self.huffman_tree_binary += bin(current_node.char)[2:].rjust(8, "0")
    self.huffman_codes[current_node.char] = current_huffman_code
  self.__create_huffman_codes(current_node.left, current_huffman_code + "0")
  self.__create_huffman_codes(current_node.right, current_huffman_code + "1")

This method is called in the main method of the class as so:

huffman_tree_root = self.huffman_tree.pop()
current_huffman_code = ""
self.__create_huffman_codes(huffman_tree_root, current_huffman_code)
self.huffman_tree_binary += "00"

I add two trailing zeroes because the binary representation of the Huffman tree always ended up at 350.75 bytes.

The method to create the bytes for compression is updated:

def __create_bytes(self, bitstring):
  byte_array = bytearray()
  for i in range(0, len(self.huffman_tree_binary), 8):
    byte = self.huffman_tree_binary[i:i + 8]
    byte_array.append(int(byte, 2))
  for i in range(0, len(bitstring), 8):
    byte = bitstring[i:i + 8]
    byte_array.append(int(byte, 2))
  return byte_array

And then the bytes are written to a binary file.

On the other side, to rebuild the tree, I call the following method:

def huffman_decompress(self):
  [... open file ...]
    [... read bytes ...]
    if self.huffman_tree_binary.pop(0) == "0":
      self.huffman_tree_root = Node(None)
    self.huffman_tree_root.left = Node(None)
    self.huffman_tree_root.right = Node(None)
    self.__rebuild_huffman_tree(self.huffman_tree_root.left)
    self.__rebuild_huffman_tree(self.huffman_tree_root.right)
    [... decompression ...]

def __rebuild_huffman_tree(self, current_node):
    if len(self.huffman_tree_binary) == 0:
      return
    self.huffman_tree_binary.pop(0)
    if self.huffman_tree_binary[0] == "1":
      self.huffman_tree_binary.pop(0)
      bits = ""
      for _ in range(8):
        bits += self.huffman_tree_binary.pop(0)
      current_node.char = int(bits, 2)
    else:
      current_node.left = Node(None)
      current_node.right = Node(None)
      self.__rebuild_huffman_tree(current_node.left)
      self.__rebuild_huffman_tree(current_node.right)

This is surely not the most elegant implementation to recursively rebuild the tree, but I can't figure out why a fraction of the leaf nodes end up at different locations in the tree. I figure (naturally) there must be something wrong with how I build the binary representation pre-compression, or how I rebuild the tree, but I haven't figured out which one might be wrong yet.


Solution

  • No, you do not want to include the frequency table in your compressed data. You are trying to compress, so you want to use as few bits as possible to provide the information required to decompress. Sending the frequency table is the worst way to do that. The frequency table contains extraneous information that is not needed to reconstruct the Huffman codes. Many, many different frequency tables will produce the same Huffman code.

    You instead want to send a representation of the Huffman code that was computed from the frequency table. Two of the most common ways are to send the tree, or to send the code lengths.

    You can send the Huffman tree very easily by simply traversing the tree recursively, as you must have done to create the Huffman codes, and sending a 0 bit for each node encountered, and a 1 bit followed by eight bits for the symbol encoded for each leaf encountered. That's it. Nothing could be easier. Then you can reconstruct the tree directly on the other end with recursion, and use the tree for decoding. This tree representation is self-terminating, and so is immediately followed by the codes for your data.

    In your example, you are encoding 100 different symbols. Then the tree will have 99 nodes and 100 leaves, and so will take 99 + 900 = 999 bits. For comparison, your frequency table, if represented as two bytes per frequency, would take 4096 bits. Or if four bytes per frequency as shown in another answer here, then 8192 bits! I could get fancy with encoding up to frequency 127 with one byte and higher with two bytes and get it down to 2148 bits. Still more than double 999 bits.

    Though you exclude it, one could do better still by using a Canonical Huffman code, where you build the code only from the code lengths for each symbol, not from the tree. Then you can just send the code lengths and that same build process followed on the decoding end. You would then use Huffman coding on those lengths, preceding it with a very small representation of that Huffman code. This is what is done in Deflate compression. Deflate represents the code from your example in 608 bits.

    Update for new code in question:

    As I said above, "sending a 0 bit for each node encountered, and a 1 bit followed by eight bits for the symbol encoded for each leaf encountered". You are always sending a 0 with each call of __create_huffman_codes. You want to send a 0 only if it's a node, and you want to send a 1, followed by the symbol, only if it's a leaf. Also you don't need to call __create_huffman_codes if it's a leaf. You're done there. You call __create_huffman_codes (twice) only if it's a node.

    Also, it's just a waste of bits to add those two zeros to bring the tree description to a byte boundary for no reason, and it complicates the decoding. Just send the first symbol code bit immediately following the last Huffman tree bit.