I am trying to write a program in Python to compress text using Huffman Compression. The problem I am having is that the compressed text ends up being larger than the original text when saved to a text file, and I am not sure why this is the case. I have implemented a class called Heapnode to help me build a priority queue and build my binary tree using Heapq. In the class HuffmanCoding I have implemented methods to get the frequency of each character, make a priority queue, use it to merge 'nodes' into a sort of binary tree, and to traverse that tree to build huffman codes for each character.
class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other): # if the frequency of one character is lower than the frequency of another one
return self.freq < other.freq
def __eq__(self, other): # if two characters have the same frequencies
if other == None:
return False
if not isinstance(other, HeapNode): # checks if the character is a node or not
return False
return self.freq == other.freq
class HuffmanCoding:
def __init__(self, text_to_compress):
self.text_to_compress = text_to_compress # text that will be compressed
self.heap = []
self.codes = {} # will store the Huffman code of each character
self.decompress_map = {}
def get_frequency(self): # method to find frequency of each character in text - RLE
frequency_Dictionary = {} # creates an empty dictionary where frequency of each character will be stored
for character in self.text_to_compress: # Iterates through the text to be compressed
if character in frequency_Dictionary:
frequency_Dictionary[character] = frequency_Dictionary[character] + 1 # if character already exists in
# dictionary, its value is increased by 1
else:
frequency_Dictionary[character] = 1 # if character is not present in list, its value is set to 1
return frequency_Dictionary
def make_queue(self, frequency): # creates the priority queue of each character and its associated frequency
for key in frequency:
node = HeapNode(key, frequency[key]) # create node (character) and store its frequency alongside it
heapq.heappush(self.heap, node) # Push the node into the heap
def merge_nodes(
self): # creates HuffmanTree by getting the two minimum nodes and merging them together, until theres
# only one node left
while len(self.heap) > 1:
node1 = heapq.heappop(self.heap) # pop node from top of heap
node2 = heapq.heappop(self.heap) # pop next node which is now at the top of heap
merged = HeapNode(None, node1.freq + node2.freq) # merge the two nodes we popped out from heap
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged) # push merged node back into the heap
def make_codes(self, root, current_code): # Creates Huffman code for each character
if root == None:
return
if root.char != None:
self.codes[root.char] = current_code
self.decompress_map[current_code] = root.char
self.make_codes(root.left, current_code + "0") # Every time you traverse left, add a 0 - Recursive Call
self.make_codes(root.right, current_code + "1") # Every time you traverse right, add a 1 - Recursive Call
def assignCodes(self): # Assigns codes to each character
root = heapq.heappop(self.heap) # extracts root node from heap
current_code = ""
self.make_codes(root, current_code)
def get_compressed_text(self, text): # Replaces characters in original text with codes
compressed_text = ""
for character in text:
compressed_text += self.codes[character]
return compressed_text
def show_compressed_text(self):
frequency = self.get_frequency()
self.make_queue(frequency)
self.merge_nodes()
self.assignCodes()
compressed_text = self.get_compressed_text(self.text_to_compress)
return compressed_text
print(HuffmanCoding('This sentence will get compressed').show_compressed_text())
Your codes are character strings containing the ASCII characters "0" and "1". Each character takes eight bits, so you are expanding your compressed data by a factor of eight.
You need to instead make variable-length binary codes, where one bit takes one bit of space. You then need to be able to concatenate those variable-length codes to make a sequence of bytes (starting with b''
, not ""
), filling out the last byte with zero bits as needed. Then you have a sequence of bytes, each containing eight of the bits from your codes. Each can have any of the 256 possible byte values.
You would use the bit operators on integers to construct this, specifically shift: <<
, >>
), or: |
, and and: &
. You can use bytes()
to convert integers to bytes. E.g.:
>>> bytes([65,66,67])
b'ABC'
Also note that you are compressing a very short string that won't get compressed even when you're writing the output as bits correctly. Especially once you send the code along with it. You'd need to test on much more text in order for the compression to take advantage of the frequencies of different letters in the English language.