Search code examples
javaoptimizationjvmreadability

is leaving redundant variables etc for readability in java performance impacting


I'm working on an optimized java library and I'm wondering if having things like

    int rX = rhsOffset;
    int rY = rhsOffset + 1;
    int rZ = rhsOffset + 2;
    int rW = rhsOffset + 3;

where local variable rX is a redundant yet makes code further down the line more readable. Does rX in this case just get compiled out at the Java byte code or JIT execution time?

Also I've seen librarys

 m[offset + 0] = f / aspect;
 m[offset + 1] = 0.0f;
 m[offset + 2] = 0.0f;
 m[offset + 3] = 0.0f;

where " + 0 " is done to improve the look of the code.

I'm wanting to do the same but would like to make sure I'm not hurting performance. I don't know of any good way to determine if memory is allocated or math is processed for ether of these cases. In Android Studio you can use the memory profiler that allows you to capture all allocations and examine them but IntelliJ doesn't appear to offer that functionality and I'm assuming I can't rely on any optimizations androids build system does to be done to a normal(non-android) Java project.


Solution

  • I wrote some code in order to investigate this experimentally, see my repository on Github.

    Summary: I ran some experiments on my 64-bit Ubuntu computer with Oracle JDK 9. From what I can tell, with these particular experiments, (i) redundant variables don't seem to affect runtime in practice and (ii) whether you add a redundant 0 or not does not seem to matter. My advice is to not worry about the kind of performance concerns that you mention, the just-in-time compiler is probably smart enough for these sorts of things and lousy performance might never be a problem in the first place.

    For the first question, I ran my experiment with both the Oracle JDK 9 javac compiler and the embeddable Janino compiler. I get similar results, suggesting that probably most optimizations are carried out by the JIT.

    I would advice you to do your own experiments on your JVM with toy examples that you believe are representative for what you are doing. Or measure directly in your actual code in case lousy performance would turn out to be a problem.

    Below follows details about my experiments.

    Question 1: Does introducing redundant variables affect execution time?

    I introduced a parameter, let's call it n, that controls the degree of redundant assignments, and wrote a code generator that will generate code for a nonsense computation and introduce redundant assignments based on the value of n. For instance, for n=0 it produces this code:

    public static double eval0(double[] X, double[] Y) {
      double sum = 0.0;
      assert(X.length == Y.length);
      int iters = X.length/3;
      for (int i = 0; i < iters; i++) {
        int at = 3*i;
        double x0 = X[at + 0];
        double x1 = X[at + 1];
        double x2 = X[at + 2];
        double y0 = Y[at + 0];
        double y1 = Y[at + 1];
        double y2 = Y[at + 2];
              double x1y2 = x1*y2;
              double x2y1 = x2*y1;
              double a = x1y2-x2y1;
              double x2y0 = x2*y0;
              double x0y2 = x0*y2;
              double b = x2y0-x0y2;
              double x0y1 = x0*y1;
              double x1y0 = x1*y0;
              double c = x0y1-x1y0;
        sum += a + b + c;
      }
    return sum;
    

    }

    and for, say n=3 it produces this code:

    public static double eval3(double[] X, double[] Y) {
      double sum = 0.0;
      assert(X.length == Y.length);
      int iters = X.length/3;
      for (int i = 0; i < iters; i++) {
        int at = 3*i;
        double x0 = X[at + 0];
        double x1 = X[at + 1];
        double x2 = X[at + 2];
        double y0 = Y[at + 0];
        double y1 = Y[at + 1];
        double y2 = Y[at + 2];
              double x1y2_28 = x1*y2;
              double x1y2_29 = x1y2_28;
              double x1y2_30 = x1y2_29;
              double x1y2 = x1y2_30;
              double x2y1_31 = x2*y1;
              double x2y1_32 = x2y1_31;
              double x2y1_33 = x2y1_32;
              double x2y1 = x2y1_33;
              double a_34 = x1y2-x2y1;
              double a_35 = a_34;
              double a_36 = a_35;
              double a = a_36;
              double x2y0_37 = x2*y0;
              double x2y0_38 = x2y0_37;
              double x2y0_39 = x2y0_38;
              double x2y0 = x2y0_39;
              double x0y2_40 = x0*y2;
              double x0y2_41 = x0y2_40;
              double x0y2_42 = x0y2_41;
              double x0y2 = x0y2_42;
              double b_43 = x2y0-x0y2;
              double b_44 = b_43;
              double b_45 = b_44;
              double b = b_45;
              double x0y1_46 = x0*y1;
              double x0y1_47 = x0y1_46;
              double x0y1_48 = x0y1_47;
              double x0y1 = x0y1_48;
              double x1y0_49 = x1*y0;
              double x1y0_50 = x1y0_49;
              double x1y0_51 = x1y0_50;
              double x1y0 = x1y0_51;
              double c_52 = x0y1-x1y0;
              double c_53 = c_52;
              double c_54 = c_53;
              double c = c_54;
        sum += a + b + c;
      }
    return sum;
    

    }

    Both these functions perform exactly the same computation, but one has more redundant assignments. Finally, I also generate a dispatch function:

    public double eval(int n, double[] X, double[] Y) {
      switch (n) {
        case 0: return eval0(X, Y);
        case 1: return eval1(X, Y);
        case 2: return eval2(X, Y);
        case 3: return eval3(X, Y);
        case 4: return eval4(X, Y);
        case 5: return eval5(X, Y);
        case 8: return eval8(X, Y);
        case 11: return eval11(X, Y);
        case 15: return eval15(X, Y);
        case 21: return eval21(X, Y);
        case 29: return eval29(X, Y);
        case 40: return eval40(X, Y);
        case 57: return eval57(X, Y);
        case 79: return eval79(X, Y);
        case 111: return eval111(X, Y);
        case 156: return eval156(X, Y);
        case 218: return eval218(X, Y);
        case 305: return eval305(X, Y);
      }
      assert(false);
      return -1;
    }
    

    All the generated code is on my repo here.

    Then I benchmark all these functions for different values of n on X and Y arrays of size 10000 filled with random data. I did this using both the Oracle JDK 9 javac compiler and the embeddable Janino compiler. My benchmarking code also lets the JIT warm up a bit. Running the benchmark produces this output:

    ------ USING JAVAC
    n = 0
    "Elapsed time: 0.067189 msecs"
       Result= -9.434172113697462
    n = 1
    "Elapsed time: 0.05514 msecs"
       Result= -9.434172113697462
    n = 2
    "Elapsed time: 0.04627 msecs"
       Result= -9.434172113697462
    n = 3
    "Elapsed time: 0.041316 msecs"
       Result= -9.434172113697462
    n = 4
    "Elapsed time: 0.038673 msecs"
       Result= -9.434172113697462
    n = 5
    "Elapsed time: 0.036372 msecs"
       Result= -9.434172113697462
    n = 8
    "Elapsed time: 0.203788 msecs"
       Result= -9.434172113697462
    n = 11
    "Elapsed time: 0.031491 msecs"
       Result= -9.434172113697462
    n = 15
    "Elapsed time: 0.032673 msecs"
       Result= -9.434172113697462
    n = 21
    "Elapsed time: 0.030722 msecs"
       Result= -9.434172113697462
    n = 29
    "Elapsed time: 0.039271 msecs"
       Result= -9.434172113697462
    n = 40
    "Elapsed time: 0.030785 msecs"
       Result= -9.434172113697462
    n = 57
    "Elapsed time: 0.032382 msecs"
       Result= -9.434172113697462
    n = 79
    "Elapsed time: 0.033021 msecs"
       Result= -9.434172113697462
    n = 111
    "Elapsed time: 0.029978 msecs"
       Result= -9.434172113697462
    n = 156
    "Elapsed time: 18.003687 msecs"
       Result= -9.434172113697462
    n = 218
    "Elapsed time: 24.163828 msecs"
       Result= -9.434172113697462
    n = 305
    "Elapsed time: 33.479853 msecs"
       Result= -9.434172113697462
    ------ USING JANINO
    n = 0
    "Elapsed time: 0.032084 msecs"
       Result= -9.434172113697462
    n = 1
    "Elapsed time: 0.032022 msecs"
       Result= -9.434172113697462
    n = 2
    "Elapsed time: 0.029989 msecs"
       Result= -9.434172113697462
    n = 3
    "Elapsed time: 0.034251 msecs"
       Result= -9.434172113697462
    n = 4
    "Elapsed time: 0.030606 msecs"
       Result= -9.434172113697462
    n = 5
    "Elapsed time: 0.030186 msecs"
       Result= -9.434172113697462
    n = 8
    "Elapsed time: 0.032132 msecs"
       Result= -9.434172113697462
    n = 11
    "Elapsed time: 0.030109 msecs"
       Result= -9.434172113697462
    n = 15
    "Elapsed time: 0.031009 msecs"
       Result= -9.434172113697462
    n = 21
    "Elapsed time: 0.032625 msecs"
       Result= -9.434172113697462
    n = 29
    "Elapsed time: 0.031489 msecs"
       Result= -9.434172113697462
    n = 40
    "Elapsed time: 0.030665 msecs"
       Result= -9.434172113697462
    n = 57
    "Elapsed time: 0.03146 msecs"
       Result= -9.434172113697462
    n = 79
    "Elapsed time: 0.031599 msecs"
       Result= -9.434172113697462
    n = 111
    "Elapsed time: 0.029998 msecs"
       Result= -9.434172113697462
    n = 156
    "Elapsed time: 17.579771 msecs"
       Result= -9.434172113697462
    n = 218
    "Elapsed time: 24.561065 msecs"
       Result= -9.434172113697462
    n = 305
    "Elapsed time: 33.357928 msecs"
       Result= -9.434172113697462
    

    From the above output, it appears that javac and Janino both produce about equally performant code, and that for low values of n, the value doesn't seem to matter. However, at n=156, we observe a dramatic increase in the runtime. I don't know why that is, but I suspect that it has to do with the number of local variables being limited on the JVM and thus the Java compiler (javac/Janino) has to use work-arounds to overcome that limitation. And those workarounds are harder for the JIT to optimize (this is what I suspect but maybe someone can shed some light on that...).

    Question 2: Does redundantly adding 0 affect performance?

    I wrote class to experiment with that. The class has two static methods that both do exactly the same computation, except that for apply0, we also add 0 when we compute the array indices:

    public class Mul2d {
        public static double[] apply0(double angle, double[] X) {
            int n = X.length/2;
            double[] Y = new double[2*n];
            double cosv = Math.cos(angle);
            double sinv = Math.sin(angle);
            for (int i = 0; i < n; i++) {
                int at = 2*i;
                Y[at + 0] = cosv*X[at + 0] - sinv*X[at + 1];
                Y[at + 1] = sinv*X[at + 0] + cosv*X[at + 1];
            }
            return Y;
        }
    
        public static double[] apply(double angle, double[] X) {
            int n = X.length/2;
            double[] Y = new double[2*n];
            double cosv = Math.cos(angle);
            double sinv = Math.sin(angle);
            for (int i = 0; i < n; i++) {
                int at = 2*i;
                Y[at] = cosv*X[at] - sinv*X[at + 1];
                Y[at + 1] = sinv*X[at] + cosv*X[at + 1];
            }
            return Y;
        }
    }
    

    running a benchmark on a large array suggests that whether you add 0 or not does not matter. Here is the output of the benchmark:

    With adding '+ 0'
    "Elapsed time: 0.247315 msecs"
    "Elapsed time: 0.235471 msecs"
    "Elapsed time: 0.240675 msecs"
    "Elapsed time: 0.251799 msecs"
    "Elapsed time: 0.267139 msecs"
    "Elapsed time: 0.250735 msecs"
    "Elapsed time: 0.251697 msecs"
    "Elapsed time: 0.238652 msecs"
    "Elapsed time: 0.24872 msecs"
    "Elapsed time: 1.274368 msecs"
    Without adding '+ 0'
    "Elapsed time: 0.239371 msecs"
    "Elapsed time: 0.233459 msecs"
    "Elapsed time: 0.228619 msecs"
    "Elapsed time: 0.389649 msecs"
    "Elapsed time: 0.238742 msecs"
    "Elapsed time: 0.23459 msecs"
    "Elapsed time: 0.23452 msecs"
    "Elapsed time: 0.241013 msecs"
    "Elapsed time: 0.356035 msecs"
    "Elapsed time: 0.260892 msecs"
    

    Runtimes appear pretty much equivalent, any differences seem to drown in the noise.

    Conclusion: Regarding Question 1, I cannot observe any negative impact on the performance for this particular toy problem.

    Regarding Question 2, whether you add +0 doesn't seem to matter. Unless the JIT optimizes away the +0, most likely the other computations in the loop dominate the total time, meaning that any extra small cost of adding +0 will drown in the noise.