I have a string, i want to find the most variable length characters up to a length to make better huffman codes:
For example, the string
"++----++--++--+-++++++-+-+-----++-+---+-+---+-++--"
from which i manually created the following Huffman to symbol to frequency string:
In [1665]: symb2freq
Out[1665]: {'++--': 4, '-+-+-': 2, '--': 1, '+-++++++': 1, '----++-+--': 1, '--+-': 1}
This allowed me to have the Huffman binary string of 110011111000011010110011 to represent:
'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
which is half the size and which is better than fixed length symb2frequencies. My question is, it there a regex, or itertools, or method of programming to create symb2frequencies of multiple lengths, with a largest length. In my case the largest length is 11. The shortest length doesn't matter, but the goal is to get the best frequencies possible to reduce the size of the huffman code to represent the tree. If i group by a set size, i can 't get a short Huffman binary string to represent the best code set.
So my question is how do i programmatically, using regex or itertools, or any other method to convert:
"++----++--++--+-++++++-+-+-----++-+---+-+---+-++--"
to:
In [1665]: symb2freq
Out[1665]: {'++--': 4, '-+-+-': 2, '--': 1, '+-++++++': 1, '----++-+--': 1, '--+-': 1}
which i built manually.
I think this is a hard problem, so thanks in advance for any advice on how to tackle this issue.
I wrote this code to show that you can use variable length codes to decode a huffman code to the pattern shown, which i use to recreate the number 1009732533765201 in half the size. I created the sym2freq manually and need help on ideas on how to create it programmatically so i can achieve better compression than set grouping sizes. I have a program that creates a mathematical pattern to recreate a number, and i'm looking to create variable length codes to best represent those patterns which are in the format of + and - 's
# Incorporated a better method of rebuilding the string from RoyM
# Uncompresses the number: 1009732533765201
# which has the binary:
# 0b11100101100101100010101100111111100100010001010001
# Our compressed binary is:
# 110011111000011010110011
# We compress a variable length pattern that is decoded
# to try to achieve nearly as good compression (maybe better in this case )and to more importantly
# to compare compression as this is a different approach
# to that of binary compression of Huffman using XOR Patterns
# S is a representation of XOR math that is required to rebuild the original number
# so we don't compress the number itself, we compress the XOR math to rebuild the number
# instead
s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}
compressed_binary='110011111000011010110011'
def decodepattern(pattern, j=1):
x=2
for c in range(len(pattern)):
if pattern[c]=='+' or pattern[c]=='1':
j^=x
else:
j^=-x
x<<=1
return abs(j)
i = 0
j = 0
uncompressed_string = ''
while(j<len(compressed_binary)):
if compressed_binary[i:j] in s:
uncompressed_string += s[compressed_binary[i:j]]
i = j
else:
j += 1
uncompressed_string += s[compressed_binary[i:j]]
answer = decodepattern(uncompressed_string)
print("Bin length of 1009732533765201 is 50 which is the same as the pattern length")
print("Original Bin of 1009732533765201 is: ")
print(bin(1009732533765201)[2:])
print("Original Pattern:")
print(uncompressed_string)
print("Compressed to: ", bin(13600435)[2:])
print("110011111000011010110011 uncompressed to the original string pattern of : ")
print(uncompressed_string)
print("Which then i was able to recreate the original integer of: ")
print(decodepattern(uncompressed_string))
Output of program is:
Bin length of 1009732533765201 is 50 which is the same as the pattern length
Original Bin of 1009732533765201 is:
11100101100101100010101100111111100100010001010001
Original Pattern:
b'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
Compressed to: 110011111000011010110011
110011111000011010110011 uncompressed to the original string pattern of :
b'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
Which then i was able to recreate the original integer of:
1009732533765201
So you can see i can use variable length codes to decompress a number, i just don't know how to create variable length codes (up to a certain length) without doing it manually.
Expected result is some sort of regex, or itertools method or any other method to take the string:
"++----++--++--+-++++++-+-+-----++-+---+-+---+-++--"
and create a symb2freq like this ( which i did manually ):
In [1665]: symb2freq
Out[1665]: {'++--': 4, '-+-+-': 2, '--': 1, '+-++++++': 1, '----++-+--': 1, '--+-': 1}
Although I am not certain about the soundness of the question, here is how you can bruteforce all possible codebooks:
#!/usr/bin/env python3
import huffman
def find_best_codebook(string, max_symbol_length, max_partitions):
min_entropy = len(string)
best_i = None
best_weights = None
best_codebook = None
for i, tokens in enumerate(gen_partitions(string, max_partitions=max_partitions)):
if any(len(t) > max_symbol_length for t in tokens):
continue
symbol_weights = compute_weights(tokens)
codebook = huffman.codebook(symbol_weights.items())
entropy = compute_entropy(symbol_weights, codebook)
print(f'i={i} X={symbol_weights} H(X)={entropy}')
if entropy < min_entropy:
best_i = i
best_weights = symbol_weights
best_codebook = codebook
min_entropy = entropy
print(f'best i={best_i} X={best_weights} H(X)={min_entropy}')
print('best codebook:', best_codebook)
return best_codebook, min_entropy
def gen_partitions(string, max_partitions=None):
if max_partitions is None:
max_partitions = len(string)
if max_partitions <= 0:
yield [string]
return
for index in range(len(string) + 1):
for rest in gen_partitions(string[index:], max_partitions - 1):
yield [string[:index]] + rest
def compute_weights(tokens):
freqs = {}
for token in tokens:
if not token:
continue
freqs[token] = freqs.get(token, 0) + 1
return freqs
def compute_entropy(symbol_weights, codebook):
return sum(len(code) * symbol_weights[symbol] for symbol, code in codebook.items())
if __name__ == '__main__':
string = '++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
max_symbol_length = 11
max_partitions = len(string) - max_symbol_length
find_best_codebook(string, max_symbol_length, max_partitions)
Note that partitioning a string of length N into all possible partitions is exponential in the string length (N split points, two decisions at each: split and no split, hence O(2^N)).
Also, given the way the question is formed, you should be able to get the best compression by simply partitioning the input into consecutive chunks of maximum codeword length. For example, using a maximum symbol length of 11 as stated in the question, you can do {'++----', '++--++--+-+', '+++++-+-+--', '---++-+---+', '-+---+-++--'}
to get 5 symbols (at most 3-bit codewords) and a compressed text of at most 15 bits.