Search code examples
javacompressionpriority-queuehuffman-codelossless-compression

Is it possible to have two different Huffman encoding scheme yielding different number of bits?


I made a solution for Huffman encoding but my code is not yielding expected output.

I cannot figure out what is wrong.

Any help is appreciated.

abcdef
1 2 3 3 5 5

Your Output: 
00 01 10 110 1110 1111 

Expected Output: 
000 001 01 10 110 111 

It can be seen clearly that number of bits are different expected one being 16 mine being 17. Is it possible to have different output for different huffman schemes?

class Solution {

    class HuffmanNode {
        char c;
        int freq;
        HuffmanNode left;
        HuffmanNode right;

        public HuffmanNode(char c, int freq) {
            this.c = c;
            this.freq = freq;
            this.left = null;
            this.right = null;
        }
    }

    class HuffmanNodeComparator implements Comparator<HuffmanNode> {
        public int compare(HuffmanNode a, HuffmanNode b) {
            return a.freq - b.freq;
        }
    }

    ArrayList<String> res = new ArrayList<>();

    public ArrayList<String> huffmanCodes(String S, int f[], int N) {
        Queue<HuffmanNode> pq = new PriorityQueue<HuffmanNode>(new HuffmanNodeComparator());

        for (int i = 0; i < N; i++) {
            pq.offer(new HuffmanNode(S.charAt(i), f[i]));
        }

        while (pq.size() > 1) {
            HuffmanNode a = pq.poll();
            HuffmanNode b = pq.poll();
            HuffmanNode newNode = new HuffmanNode('-', a.freq + b.freq);

            newNode.left = a;
            newNode.right = b;

            pq.offer(newNode);
        }

        dfs(pq.poll(), "");

        return res;
    }

    public void dfs(HuffmanNode root, String HuffmanCode) {
        if (root == null) return;
        if (root.c != '-') {
            res.add(HuffmanCode);
            return;
        }

        dfs(root.left, HuffmanCode + "0");
        dfs(root.right, HuffmanCode + "1");
    }
}

Solution

  • Both are valid prefix codes, and both are valid Huffman codes for the set of frequencies 1, 2, 3, 3, 5, 5.

    However the ordering of your codes in both cases seems arbitrary. If they are supposed to correspond to the frequencies, then both outputs are wrong. Longer codes correspond to lower frequencies. A sequence of non-decreasing frequencies will correspond to a sequence of non-increasing code lengths.

    The way you compute the efficiency of a code is not by adding the lengths. You sum the products of the lengths and corresponding frequencies. If I sort and associate the lengths and frequencies appropriately, then the first code is 2x5 + 2x5 + 2x3 + 3x3 + 4x2 + 4x1 = 47. The second code is 2x5 + 2x5 + 3x3 + 3x3 + 3x2 + 3x1 = 47. Both codes are optimal, with both taking 47 bits to represent the message.

    You can get either code from Huffman's algorithm because there is an arbitrary choice at one point. After you pair 1 and 2 to get a subtree with frequency 3, you now have two other frequency 3 leaves. So now you can either combine the subtree and one of the leaves, or you can combine the two leaves. Those result in different trees at the end. However both will be optimal.