Search code examples
javaassociationsfrequencycorpus

Mutual Information: Calculation example (Java) in contingency table style


I am using the pointwise mutual information (PMI) association measure to calculate how frequently words co-occure by using word-frequencies obtained from a large corpus.

I am calculating PMI via the classical formulae of

log(P(X,Y) / (P(X)*P(Y))

and using the contingency table notation with joint- and marginal frequencies I found on http://collocations.de/AM/index.html

The results I get are very similar, but not the same. As far as I understood things both methods should result in the exact same result value. I made a little Java-programm (minimal working example) that uses word-frequencies from a corpus using both formulae. I get different results for the two methods. Does someone know why ?

public class MutualInformation
{
    public static void main(String[] args)
    {
        long N = 1024908267229L;

        // mutual information = log(P(X,Y) / P(X) * P(Y))
        double XandY = (double) 1210738 / N;
        double X = (double) 67360790 / N;
        double Y = (double) 1871676 / N;

        System.out.println(Math.log(XandY / (X * Y)) / Math.log(10));
        System.out.println("------");

        // contingency table notation as on www.collocations.de
        long o11 = 1210738;
        long o12 = 67360790;
        long o21 = 1871676;
        long c1 = o11 + o21;
        long r1 = o11 + o12;
        double e11 = ((double) r1 * c1 / N);
        double frac = (double) o11 / e11;
        System.out.println(Math.log(frac) / Math.log(10));
    }

}

Solution

  • Let write it in the same terms

       long o11 = 1210738;
       long o12 = 67360790;
       long o21 = 1871676;
       long N = 1024908267229L
    

    The first equation is

       XandY = o11 / N;
       X = o12 / N;
       Y = o21 / N;
    

    so

      XandY / (X * Y)
    

    is

     (o11 / N) / (o12 / N * o21 / N)
    

    or

     o11 * N / (o12 * o21)
    

    Note there is no adding going on.

    The second equation is rather different.

    c1 = o11 + o21;
    r1 = o11 + o12;
    e11 = ((double) r1 * c1 / N);
    frac = (double) o11 / e11;
    

    so

    e11 = (o11 + o21) * (o11 + o12) /N;
    frac = (o11 * N) / (o11^2 + o11 * o12 + o21 * o11 + o21 * o12);
    

    I would expect these to be different as mathematically they are not the same.

    I suggest you write what you want as maths first, and then find the most efficient way of coding it.