I am looking for a state of the art library for estimating differential entropy from finite samples. In an ideal world, it would have the following features:
What are my options?
First of all: off-topic as there is the https://softwarerecs.stackexchange.com/ for these kind of questions (but personally I don't mind).
Secondly, you can't prove a negative but if your data is continuous and multidimensional, I would say that probably nothing ticks all of these boxes out-of-the-box.
I have implemented the Kraskov estimator and a bunch of related measures in python as at the time there wasn't anything else publicly available apart from a couple of dubious scripts written in MATLAB on the Mathworks exchange (my project can be found here). Most of the heavy lifting is either pushed down to C (as I use cKDTree
to find nearest neighbours) or to LAPACK/BLAS (i.e. Fortran), so I don't think that there is much to be gained by further optimization. At least for my data sets, the python "overhead" is small compared to everything else.
I don't do any bias correction in the published version of the repository. This is by design as I think that if your interactions between variables are small enough that you need to worry about biases then you really need to worry about it. All bias correction methods have a bunch of assumptions baked in and providing anything out of the box does more harm than good, IMO.
Then there is NPEET, which is also in python, also build around the Kraskov estimator, and very, very similar to my stuff (so similar in fact that when I first read the source, I thought they had forked my repo until I saw that they first published their code a month before me).
Finally, there is MINE, an algorithm developed in Joshua Bengio's group. Their approach is conceptually very different from Kozachenko/Kraskov, and a very interesting read. They published their method last year but there are already a couple of implementations on github. I haven't had a chance to try it out myself, nor have I looked in detail at any of the implementations, so I don't have an informed opinion on it (other than that I am big fan of Joshua Bengio's work in general). The paper looks very promising but I haven't seen an independent evaluation so far (that doesn't mean there isn't one, though). However, they are training a neural network with gradient descent on mini-batches to estimate the mutual information, so I don't expect it to be fast. At all.
For discrete/binned data, there is Ilya Nemenman's NSB estimator, which ticks all boxes apart from your first one, which presumably is your crucial criterion.