Search code examples
calgorithmdata-structuresbinary-treehuffman-code

Constructing sequential Huffman Tree From Scratch


Given some textual file, I need to read each alphanumeric characters and code them using Huffman's algorithm.

Reading characters, storing probabilities and creating nodes are solved as well as creating Huffman's trie using pointers.

However, I need to create and initialize Huffman's tree using a sequential representation of a binary tree, without any pointers.

This could be done by creating a regular tree using pointers and then just reading it into the array, but I aim to directly populate an array with the nodes.

I considered creating smaller trees and merging them together but opted for a matrix representation where I would gather elements with the smallest probabilities from a binary heap and store them into the rows of a matrix where row of a matrix would represent the level at which the node should be in a binary tree, in a reverse order that is.

E.g. Given characters and their probabilities as char[int] pairs.

a[1], b[1], c[2], d[1], e[3], f[11], g[2]

I aim to create matrix that looks like 
____________________________________
    a   |    b   |    d   |    g   |
____________________________________
   ab   |    c   |   dg   |    e   |
____________________________________
   abc  |   deg  |        |        |
____________________________________ 
 abcdeg |    f   |        |        |  
____________________________________
abcdefg |        |        |        |
____________________________________

Where levels of a, b, c, d, e & f would be rows of a matrix.

Currently, I'm stuck on how to recursively increment levels of elements when their "parent" moves (If I'm combining two nodes from the different levels ['ab' and 'c'], I easily equal level of c with ab and solve problem, but in case that for example 'c' and 'd' where both in second row) and how to create the full binary tree (If it has left son, it needs to have right one) with only levels of terminal nodes.

In advance, I understand that the question is not very specific and would appreciate to hear if there's another approach to this problem instead of just solving the mentioned one.


Solution

  • Is this a contrived problem for homework? I ask because representations of trees that don't use links require O(2^h) space to store a tree of height h. This is because they assume the tree is complete, allowing index calculations to replace pointers. Since Huffman trees can have height h=m-1 for an alphabet of size m, the size of the worst case array could be enormous. Most of it would be unused.

    But if you give up the idea that a link must be a pointer and allow it to be an array index, then you're fine. A long time ago - before the dynamic memory allocators became common - this was standard. This problem is particularly good for this method because you always know the number of nodes in the tree in advance: one less than twice the alphabet size. In C you might do something like this

    typedef struct {
      char ch;
      int f;
      int left, right; // Indices of children. If both -1, this is leaf for char ch.
    } NODE;
    
    #define ALPHABET_SIZE 7
    NODE nodes[2 * ALPHABET_SIZE - 1] = {
      { 'a', 1, , -1, -1}, 
      { 'b', 1, -1, -1 }, 
      { 'c', 2, -1, -1 }, 
      { 'd', 1, -1, -1 }, 
      { 'e', 3, -1, -1 },
      { 'f', 11, -1, -1 }, 
      { 'g', 2, -1, -1 },
      // Rest of array for internal nodes
    };
    int n_nodes = ALPHABET_SIZE;
    
    int add_internal_node(int f, int left, int right) {
      // Allocate a new node in the array and fill in its values.
      int i = n_nodes++;
      nodes[i] = (NODE) { .f = f, .left = left, .right = right };
      return i;
    }
    

    Now you'd use the standard tree-building algorithm like this:

    int build_huffman_tree(void) {
      // Add the indices of the leaf nodes to the priority queue.
      for (int i = 0; i < ALPHABET_SIZE; ++i)
        add_to_frequency_priority_queue(i);
      while (priority_queue_size() > 1) {
        int a = remove_min_frequency(); // Removes index of lowest freq node from the queue.
        int b = remove_min_frequency();
        int p = add_internal_node(nodes[a].f + nodes[b].f, a, b);
        add_to_frequency_priority_queue(p);
      }
      // Last node is huffman tree root.
      return remove_min_frequency();
    }
    

    The decoding algorithm will use the index of the root like this:

    char decode(BIT bits[], int huffman_tree_root_index) {
      int i = 0, p = huffman_tree_root_index;
      while (node[p].left != -1 || node[p].right != -1) // while not a leaf
        p = bits[i++] ? nodes[p].right : nodes[p].left;
      return nodes[p].ch;
    }
    

    Of course this doesn't return how many bits were consumed, which a real decoder needs to do. A real decoder is also not getting its bits in an array. Finally, for encoding you want parent indices in addition to the children. Working out these matters ought to be fun. Good luck with it.