encoding compression entropy information-theory

How many bits are required to encode information in probability set G = [0.001, 0.002, 0.003, 0.994]?

I am currently working on data compression and thought it would be a good time to read up on the basics of information theory to better understand data compression and its algorithms.

As I understand, given a set of data, we can compute the minimum no. of bits on average required to encode the data my multiplying the rounded-up entropy value with length of the set. Following is the formula to compute the entropy of a set:

H ( X ) = − ∑ i = 1 N p ( x i ) log 2 ⁡ ( p ( x i ) )

So, for a set G = [A,B,B,C,C,C,D,D,...D] of length 1000, the probability set would be [0.001, 0.002, 0.003, 0.994] and the Entropy would be 0.06. Rounding this up would give 1. This means 1 * 1000 = 1000 bits would be required to encode this set.

This would entail I use only 1 bit per symbol to encode this whole set. I am unable to understand how can I use just 1 bit per symbol when there are 4 unique symbols in the set. Won't I require 2 bits per symbol at least? G = [00, 01, 01, 10, 10, 10, 11, 11, ..., 11].

But this would lead to a usage of 2000 bits in total betraying the value computed using entropy. What am I missing here?

Solution

No, you can encode it in 62 bits on average. You can round up after multiplying by the number of symbols. This would be done using an arithmetic code, which can use less than one bit per symbol.

If you tried to use Huffman encoding, for which the number of bits assigned to any symbol must be an integer, the number of bits are 1, 2, 3, and 3. Then the average number of bits to code one of those symbols, arrived at by multiplying by the probabilities, would be 1.009 bits. So 1000 symbols would be 1009 bits, on average.