Search code examples
pythonmutual-information

Calculating the mutual information between two random vectors returns the same value


I want to calculate the mutual information between two numpy vectors:

>>>from sklearn.metrics.cluster import mutual_info_score
>>>import numpy as np

>>>a, b = np.random.rand(10), np.random.rand(10)
>>>mutual_info_score(a, b)
1.6094379124341005

>>>a, b = np.random.rand(10), np.random.rand(10)
>>>mutual_info_score(a, b)
1.6094379124341005

As you can see, although I updated a and b, it returned the same value. Then I tried another example:

>>>a = np.array([167.52523295,  73.2904335 ,  98.61953303, 152.17297007,
       211.01341451, 327.72296346, 356.60500081,  43.9371432 ,
       119.09474284, 125.20180842])

>>>b = np.array([280.9287028 , 131.76304983, 176.0277832 , 188.56630096,
       229.09811401, 228.47200012, 617.67000122,  52.7211511 ,
       125.95361582, 148.55247447])

>>>mutual_info_score(a, b)
2.302585092994046


>>>a = np.array([ 6.71381009,  1.43607653,  3.78729242, -4.75706796, -3.81281173,
        3.23440092, 10.84495625, -0.19646145,  4.09724507, -0.13858104])

>>>b = np.array([ 4.25330873,  3.02197642, -3.2833848 ,  0.41855662, -3.74693531,
        0.7674982 , 11.36459148,  0.64636462,  0.51817262,  1.65318943])

>>>mutual_info_score(a, b)
2.302585092994046

Why? Look at the difference between those numbers. Why it returns the same value? More importantly, how do I calculate the MI between two vectors?


Solution

  • In that case, you will obtain different numbers each time you run the cell. Here, you're utilizing a method that is suitable for measuring the quality of clustering results!
    Let's quickly jump into the principal material. For observing the mutual information (MI) between two vectors (or even several vectors), you can use the mutual_info_regression function (as described here):

    In [1]: from sklearn.feature_selection import mutual_info_regression
    
    In [2]: a, target = np.random.rand(10, 3)+300, np.random.rand(10)
    
    In [3]: mi = mutual_info_regression(a, target)
    
    In [4]: mi
    Out[4]: array([0.18373016, 0.19396825, 0.09634921])
    

    In the above, I calculated the MI between each feature of the a with the target! E.g., the MI between the first feature and the target is ~0.184. There are various ways to calculate MI between variables, e.g.:

    • estimate mutual information (MI) with histograms. E.g., code:

      from sklearn.metrics import mutual_info_score
      
      def MI(x, y, bins):
          c_xy = np.histogram2d(x, y, bins)[0]
          mi = mutual_info_score(None, None, contingency=c_xy)
          return mi
      

      The challenge is finding a suitable value for the number of bins here. [1]

    • based on entropy estimation from k-nearest neighbors' distances (mutual_info_regression is based on this approach)

    • etc.

    P.S. Reading this document is worthwhile.