Search code examples
javarlinear-regressionanovaapache-commons-math

Is there an equivalent function for anova.lm() in Java?


I am comparing two linear models in R with Anova, and I would like to do the same thing in Java. To simplify it, I took the example code from https://stats.stackexchange.com/questions/48854/why-am-i-getting-different-intercept-values-in-r-and-java-for-simple-linear-regr and modified it a bit below. The models are test_trait ~ geno_A + geno_B and test_trait ~ geno_A + geno_B + geno_A:geno_B. The coefficients of the models implemented in R and Java are the same. In R I use anova(fit, fit2) where the fits are the results of lm and in Java I use TestUtils.oneWayAnovaPValue from org.apache.commons.math3.

With R I get a pvalue of 0.797, while with Java I get a pvalue of 0.817, so this is not the right method, but I can not find how to do it correctly. Is there an equivalent of R's anova.lm in Java?

The full code is below.

R

test_trait <- c( -0.48812477 , 0.33458213, -0.52754476, -0.79863471, -0.68544309, -0.12970239,  0.02355622, -0.31890850,0.34725819 , 0.08108851)
geno_A <- c(1, 0, 1, 2, 0, 0, 1, 0, 1, 0)
geno_B <- c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0) 
fit <- lm(test_trait ~ geno_A+geno_B)
fit2 <- lm(test_trait ~ geno_A + geno_B + geno_A:geno_B)

which gives the coefficients

> fit
Call:
lm(formula = test_trait ~ geno_A + geno_B)

Coefficients:
(Intercept)       geno_A       geno_B  
   -0.03233     -0.10479     -0.60492  

> fit2
Call:
lm(formula = test_trait ~ geno_A + geno_B + geno_A:geno_B)

Coefficients:
  (Intercept)         geno_A         geno_B  geno_A:geno_B  
    -0.008235      -0.152979      -0.677208       0.096383  

And the Anova

> anova(fit, fit2) # 0.797 
Analysis of Variance Table

Model 1: test_trait ~ geno_A + geno_B
Model 2: test_trait ~ geno_A + geno_B + geno_A:geno_B
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1      7 0.77982                           
2      6 0.77053  1 0.0092897 0.0723  0.797

Java

    double [] y =  {-0.48812477,  0.33458213,  
            -0.52754476, -0.79863471,
            -0.68544309, -0.12970239,
             0.02355622, -0.31890850,
             0.34725819,  0.08108851};
double [][] x = {{1,0}, {0,0},
                 {1,0}, {2,1},
                 {0,1}, {0,0},
                 {1,0}, {0,0},
                 {1,0}, {0,0}};
double [][] xb = {{1,0,0}, {0,0,0},
                  {1,0,0}, {2,1,2},
                  {0,1,0}, {0,0,0},
                  {1,0,0}, {0,0,0},
                  {1,0,0}, {0,0,0}};

OLSMultipleLinearRegression regr = new OLSMultipleLinearRegression();
regr.newSampleData(y, x);
double[] beta = regr.estimateRegressionParameters();   

System.out.printf("First model: y = int + genoA + genoB\n");
System.out.printf("Intercept: %.3f\t", beta[0]);
System.out.printf("beta1: %.3f\t", beta[1]);
System.out.printf("beta2: %.3f\n\n", beta[2]);

regr.newSampleData(y, xb);
double[] betab = regr.estimateRegressionParameters();   

System.out.printf("Second model: y = int + genoA + genoB + genoA:genoB\n");
System.out.printf("Intercept: %.3f\t", betab[0]);
System.out.printf("beta1: %.3f\t", betab[1]);
System.out.printf("beta2: %.3f\t", betab[2]);
System.out.printf("beta2: %.3f\n", betab[3]);

Which gives the same coefficients as in R

First model: y = int + genoA + genoB
Intercept: -0.032   beta1: -0.105   beta2: -0.605

Second model: y = int + genoA + genoB + genoA:genoB
Intercept: -0.008   beta1: -0.153   beta2: -0.677   beta2: 0.096

But the Anova gives a different result

List classes = new ArrayList();
classes.add(beta);
classes.add(betab);
double pvalue = TestUtils.oneWayAnovaPValue(classes);
double fvalue = TestUtils.oneWayAnovaFValue(classes);
System.out.println(pvalue); 
System.out.println(fvalue); 

0.8165390406874127
0.05979444576790511

Solution

  • You're very much misunderstanding ANOVA in the case where you compare two regressions. That's not an ANOVA in the sense of oneWayAnova. The equivalent of onewayAnova in R is the function aov. The function anova on the other hand implements a multitude of tests for comparison of models, and the name anova is confusing to say the least...

    If you compare two regression models, you want to do an F test on sums of squares. What you do in your code, is an one-way ANOVA to see if the two sets of regression parameters differ significantly. That's not what you want to do, but that's exactly what your JAVA code is doing.

    In order to calculate the correct F test, you need to do the following:

    1. calculate MSE for the largest model by dividing the Residual Sum of Squares (RSS) by the degrees of freedom (df) (in the R table : 0.77053 / 6

    2. calculate the MSEdifference by substracting the RSS of both models (result is "Sum of Sq." in the R table), substracting the df for both models (result is "Df" in the R table), and divide these numbers.

    3. Divide 2 by 1 and you have the F value

    4. calculate the p-value using the F value in 3 and for df the df-difference in the numerator and df of the largest model in the denominator.

    As far as I know, the class OLSMultipleLinearRegression doesn't have any convenient methods to extract the number of degrees of freedom, so this is not straight-forward to do in Java. You'll have to calculate the df manually and then use the class FDistribution to calculate the p value.

    eg:

    OLSMultipleLinearRegression regr = new OLSMultipleLinearRegression();
    regr.newSampleData(y, x);
    double SSR1 = regr.calculateResidualSumOfSquares();
    double df1 = y.length - (x[0].length + 1); 
        //df = n - number of coefficients, including intercept
    
    regr.newSampleData(y, xb);
    double SSR2 = regr.calculateResidualSumOfSquares();
    double df2 = y.length - (xb[0].length + 1);
    
    double MSE = SSR2/df2; // EDIT: You need the biggest model here!
    double MSEdiff = Math.abs ((SSR2 - SSR1) / (df2 - df1));
    double dfdiff = Math.abs(df2 - df1);
    
    double Fval = MSEdiff / MSE;
    
    FDistribution Fdist = new FDistribution(dfdiff, df2);
    double pval = 1 - Fdist.cumulativeProbability(Fval);
    

    Now both the F value and the p value should be exactly what you see in the anova() table of R. df1 and df2 are the column Res.Df in the R table, the difference should be Df in the R table, and MSEdiff should be the same as Sum of Sq. divided by Df from the R table.

    Disclaimer: I'm a bad JAVA programmer, so code above is more conceptual than actual code. Please look for typos or stupid mistakes and check the documentation of the FDistribution class I used here :

    https://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/distribution/FDistribution.html#cumulativeProbability%28double%29

    And now you know why statisticians use R instead of Java ;-)


    EDIT : The FDistribution used in the code above is the class

    org.apache.commons.math3.distribution.FDistribution
    

    There's also an FDistribution in JSci :

    JSci.maths.statistics.FDistribution
    

    If you use that one, the last part of the code becomes:

    FDistribution Fdist = new FDistribution(dfdiff, df2);
    double pval = 1 - Fdist.cumulative(Fval);
    

    Depending on the exact implementation, the cumulative probabilities might differ slightly. I have alas no idea about the difference and/or which one can be trusted better.