Search code examples
pythonstatisticsentropy

Coverage-adjusted entropy estimation in Python


I need to estimate entropy for many variables in vocabulary data, and some of these have only small samples. I have previously done this using the Chao-Shen entropy estimation in R, but now I would like to be able to do it in Python.

Does anyone know of an implementation in Python for a "coverage-adjusted" entropy estimator, Chao-Shen or similar?

I've looked at scipy.stats.entropy, and it doesn't seem to offer any coverage-adjusted estimator (though I've used it plenty for empirical entropy calculations).


Solution

  • Here is a python translation of the source code in R

    import numpy as np
    
    #Python translation of https://github.com/cran/entropy/blob/master/R/entropy.ChaoShen.R
    def CAE_entropy(counts):
        counts = counts[counts>0]
        n = np.sum(counts)
        p = counts/n
        
        f1 = np.count_nonzero(counts==1)
        if(f1 == n): f1 = n-1
        
        C = 1 - f1 / n
        pa = C*p
        la = (1 - (1-pa)**n)
        
        return -np.sum(pa*np.log(pa)/la)