Search code examples
matlabdecision-treeentropy

Entropy of pure split caculated to NaN


I have written a function to calculate entropy of a vector where each element represents number of elements of a class.

function x = Entropy(a)
    t = sum(a);
    t = repmat(t, [1, size(a, 2)]);
    x = sum(-a./t .* log2(a./t));
end

e.g: a = [4 0], then entropy = -(0/4)*log2(0/4) - (4/4)*log2(4/4)

But for above function, the entropy is NaN when the split is pure because of log2(0), as in above example. The entropy of pure split should be zero.

How should I solve the problem with least effect on performance as data is very large? Thanks


Solution

  • I would suggest you create your own log2 function

    function res=mylog2(a)
       res=log2(a);
       res(isinf(res))=0;
    end
    

    This function, while breaking the log2 behaviour, can be used in your specific example because you are multiplying the result with the inside of the log, thus making it zero. It is not "mathematically correct", but I believe that's what you are looking for.