Search code examples
pythoncompressionhuffman-code

Why is my Huffman Compression making the Compressed output a larger size than my original text?


I am trying to write a program in Python to compress text using Huffman Compression. The problem I am having is that the compressed text ends up being larger than the original text when saved to a text file, and I am not sure why this is the case. I have implemented a class called Heapnode to help me build a priority queue and build my binary tree using Heapq. In the class HuffmanCoding I have implemented methods to get the frequency of each character, make a priority queue, use it to merge 'nodes' into a sort of binary tree, and to traverse that tree to build huffman codes for each character.



class HeapNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    def __lt__(self, other):  # if the frequency of one character is lower than the frequency of another one
        return self.freq < other.freq

    def __eq__(self, other):  # if two characters have the same frequencies
        if other == None:
            return False
        if not isinstance(other, HeapNode):  # checks if the character is a node or not
            return False
        return self.freq == other.freq


class HuffmanCoding:
    def __init__(self, text_to_compress):
        self.text_to_compress = text_to_compress  # text that will be compressed
        self.heap = []
        self.codes = {}  # will store the Huffman code of each character
        self.decompress_map = {}

    def get_frequency(self):  # method to find frequency of each character in text - RLE
        frequency_Dictionary = {}  # creates an empty dictionary where frequency of each character will be stored

        for character in self.text_to_compress:  # Iterates through the text to be compressed
            if character in frequency_Dictionary:
                frequency_Dictionary[character] = frequency_Dictionary[character] + 1  # if character already exists in
                # dictionary, its value is increased by 1
            else:
                frequency_Dictionary[character] = 1  # if character is not present in list, its value is set to 1

        return frequency_Dictionary

    def make_queue(self, frequency):  # creates the priority queue of each character and its associated frequency
        for key in frequency:
            node = HeapNode(key, frequency[key])  # create node (character) and store its frequency alongside it
            heapq.heappush(self.heap, node)  # Push the node into the heap

    def merge_nodes(
            self):  # creates HuffmanTree by getting the two minimum nodes and merging them together, until theres
        # only one node left
        while len(self.heap) > 1:
            node1 = heapq.heappop(self.heap)  # pop node from top of heap
            node2 = heapq.heappop(self.heap)  # pop next node which is now at the top of heap

            merged = HeapNode(None, node1.freq + node2.freq)  # merge the two nodes we popped out from heap
            merged.left = node1
            merged.right = node2

            heapq.heappush(self.heap, merged)  # push merged node back into the heap

    def make_codes(self, root, current_code):  # Creates Huffman code for each character
        if root == None:
            return

        if root.char != None:
            self.codes[root.char] = current_code
            self.decompress_map[current_code] = root.char

        self.make_codes(root.left, current_code + "0")  # Every time you traverse left, add a 0 - Recursive Call
        self.make_codes(root.right, current_code + "1")  # Every time you traverse right, add a 1 - Recursive Call

    def assignCodes(self):  # Assigns codes to each character
        root = heapq.heappop(self.heap)  # extracts root node from heap
        current_code = ""
        self.make_codes(root, current_code)

    def get_compressed_text(self, text):  # Replaces characters in original text with codes
        compressed_text = ""
        for character in text:
            compressed_text += self.codes[character]
        return compressed_text

    def show_compressed_text(self):

        frequency = self.get_frequency()
        self.make_queue(frequency)
        self.merge_nodes()
        self.assignCodes()

        compressed_text = self.get_compressed_text(self.text_to_compress)
        return compressed_text


print(HuffmanCoding('This sentence will get compressed').show_compressed_text())

Solution

  • Your codes are character strings containing the ASCII characters "0" and "1". Each character takes eight bits, so you are expanding your compressed data by a factor of eight.

    You need to instead make variable-length binary codes, where one bit takes one bit of space. You then need to be able to concatenate those variable-length codes to make a sequence of bytes (starting with b'', not ""), filling out the last byte with zero bits as needed. Then you have a sequence of bytes, each containing eight of the bits from your codes. Each can have any of the 256 possible byte values.

    You would use the bit operators on integers to construct this, specifically shift: <<, >>), or: |, and and: &. You can use bytes() to convert integers to bytes. E.g.:

    >>> bytes([65,66,67])
    b'ABC'
    

    Also note that you are compressing a very short string that won't get compressed even when you're writing the output as bits correctly. Especially once you send the code along with it. You'd need to test on much more text in order for the compression to take advantage of the frequencies of different letters in the English language.