Shannon's entropy for non-equiprobable occurence of symbols in a block

I am trying to understand the concept of Shannon's entropy and deciding the codelength. In first case, b is an array of 5 symbols. In general, there could be any integer value between 1 and 8 in b. For this, Shanneon's entropy = NaN.

clear all
b = [1,3,2,6,1];
p_1 = sum(b==1)/length(b);
p_2 = sum(b==2)/length(b);
p_3 = sum(b==3)/length(b);
p_4 = sum(b==4)/length(b);
p_5 = sum(b==5)/length(b);
p_6 = sum(b==6)/length(b);
p_7 = sum(b==7)/length(b);
p_8 = sum(b==8)/length(b);

ShEntropy =  -p_1 * log2(p_1) - (p_2) * log2(p_2) - p_3 * log2(p_3) -p_4 * log2(p_4) -p_5 * log2(p_5) -p_6 * log2(p_6)...
    -p_7 * log2(p_7) -p_8 * log2(p_8)
%codelength
L = max(- log2(p_1), -log2(p_2), -log2(p_3), -log2(p_4), -log2(p_5), -log2(p_6), -log2(p_7), -log2(p_8))

UPDATE:

Attached is a screenshot of a graph which allows to determine the word length L for a correlated sequence generated from a stationary, ergodic source. (pubmedcentralcanada.ca/pmcc/articles/PMC4736934/bin/rsos150527supp1.pdf) where they have shown the calculation of word length. In the graph, since the max entropy is achieved at L=8, so the word length is 8.

**Question ** : The formula in Eq(2) is the Shannon's entropy rate which differs from the usual formula for iid sources. I cannot understand what would be N_2L in the numerator? In the original question (before update) array b is of length N =5. So, the value of entropy is a scalar. But in Eq(2), I cannot understand how to implement it because Shannons entropy in this paper is based on $N$ and 2L

For any sequence composed of unique symbols k ( for my case k=8), how to implement Eq(2)? My understanding is that if length(b) = N Eg. N = 20, then I calculate Eq(2) as S_T for L = 1, S_T for L=2 and so on till S_T for N=20. However, my confusion arises because entropy is calculated based on the number of unique symbols which in case of binary is k=2.

Solution

What you're doing wrong is that the limit p->0 of p log(p) is 0. Therefore, you can compute it as p*log(p) only for p>0. For p=0 this will be 0*inf, which is NaN, but it should be zero.

Something of this sort would help:

entropy = @(p) -sum( p(p>0) .* log2(p(p>0)) );

Hope that helps.

edit: In an attempt to add clarification in response to your comments: The above formula calculates the entropy of a source that emits N symbols, say s1, ..., sN there the probability to see the n-th symbol sn is pn.

If you have a source that emits binary, then you only have two symbols, say, -1 and +1, with probability p and 1-p and the entropy of this source is -p*log(p) - (1-p)*log(1-p). End of story.

However, this is only the entropy of the source if we treat each symbol separately. It maybe so that the source emits codewords that consist of a number of adjacent symbols and the true structure of the source is only revealed once we look at trains of, say L symbols making up codewords. As an example, in natural language, if you looked at a text only as occurence of letters you would see little structure (e would be more frequent then, say, x, but that's pretty much it), the true nature of the structure in language becomes apperent only once you look at trains of symbols together, e.g., sc is likely followed by h, or even longer structures like words and sequences of words.

To mirror this, we can look at the entropy of codewords formed by L consecutive symbols. If your source is binary, there are N=2^L possible words of length L (e.g., for L=2 there are four codewords (00, 01, 10, 11), for L=3 there are eight, and so on). Each word can be associated to a probability, and the entropy is calculated in the same way, HL = -sum(p(p>0).*log2(p(p>0))).

If you have no way of knowing the probability analytically, you can try to get it numerically by observing a long sample and counting how often each of the N=2^L codewords was seen. The longer L the harder, since the number of codewords grows very rapidly.