Dynamic Compilation of JVM

Thu 14 February 2019

One of the most crucial parts of the JVM is the Just-In-Time (JIT) compiler; understanding how JVM dynamic compilation and optimization work is key to getting a sense of how to measure and improve performace of JVM applications by writing JIT friendly applications. Having a mental model of how code execution can help us make informed decision. The other key point Micro benchmarking is hard to get right in general, but it is even harder to get right on JVM due the adpative nature of the compiler.

  1. JVM code execution in a nutshell
  2. Perils of microbenchmarking
  3. case for measuring JVM applications

JVM code execution in a nutshell

This should provide motivation for why it is even more important to measure and identify hotspots in your application code . The runtime execution of Java code is far more complex than native C/C++ applications, for example gcc comipler compiles and optimizes C code Ahead of time (AOT), which means we can use a disassemble on binaries to see exactly the machine code which will run on the hardware. JVM on the other hand does not do this as it several key reasons

Given the contraints above JVM makes a completely different set of tradeoffs, to enable a very adaptive system. Code compilation and optimization are performed at runtime, continously and highly dependend on work-load of the applicaion. So Unlike a statically compiled language like C/C++, We dont know the exact code executed on the cpu until runtime. A high level overview of how JVM executes code is as outlined below

Compiler javac simply transform Java source code into Java Bytecode, but this isn't machine code which a cpu can execute, it is simply a machine independent intermediate representation. In this process it performs some minimal and elementry optimization (this is by design as we shall see most powerful optimization are done by the JIT)

public class StringConcat {

    public static void main(String[] args) {
        String str = "is this";
        str += "inefficient ?";

        System.out.print(str);

    }

}

Using the hotspot java-8 javac compiler, the code above is transformed the following byte-code javap -c StringConcat.class

public class net.isaiahp.StringConcat {
  public net.isaiahp.StringConcat();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: return

  public static void main(java.lang.String[]);
    Code:
       0: ldc           #2                  // String is this
       2: astore_1
       3: new           #3                  // class java/lang/StringBuilder
       6: dup
       7: invokespecial #4                  // Method java/lang/StringBuilder."<init>":()V
      10: aload_1
      11: invokevirtual #5                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      14: ldc           #6                  // String inefficient ?
      16: invokevirtual #5                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      19: invokevirtual #7                  // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
      22: astore_1
      23: getstatic     #8                  // Field java/lang/System.out:Ljava/io/PrintStream;
      26: aload_1
      27: invokevirtual #9                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
      30: return
}

Notice @ 7: of generated byte-code a StringBuilder is initialized

However the byte-code produce by javac cannot be executed on a real cpu, Its the JVM responsibilty to turn byte code into native code for the cpu to execute. This is can be done in two ways - JVM can interpret and emulate each byte code instruction, this approach is simple to implement but suffer from slower execution as native code can be orders of magnitude faster - The other option is JVM can compile each method to native code for the particulat platform and then run the machine code directly on the cpu, the trade off here is code generation at rutime is can be expensive and can add too much overhead especially for cold method. (ie if a method is only executed once, then any benefits in execution speed is outweighed by time to compile)

Byte-Code cannot be run on a physical CPU so how does the HotSpot JVM actually executes code ?

As we shall see most JVMs use a combination of the of both the above approaches.

The diagram below illustrates the code-execution lifecycle on modern JVMs. jvm-execution-life-cycle

  1. Byte Code Interpreter- All code execution is first done through the byte code interpreter, once the .class files are loaded by class loader the method that is executed is stored in a reserved area of memory called the Method Cache, this contains verified byte code for each method. Only hot methods deemed worthy of compilation to navtive code are actually compiled.

    Early version of JVM execute code only in interpreted mode, you can still run an application in interpreted mode only by -XX:+Int jvm arg

  2. Profiling: One of the main function of JVM is to determine which methods to compile at runtime. To determine the hotness of a method, an invocation counter is associated with each method and each time the interpreter executes a method counter is increment. Background threads in JVM observe and gather statistics on which methods and what parts of the method bytecode from the Method Cache are most commonly run. The information that has been collected about the execution of the app is called a profile. The compiler can then use this profile to make some of its optimizations more effective, which can result in signigicant performace benefits, this technique is often reffered to as profile-guided optimization.

  3. JIT compilation (native code generation): Once a hot method has been indetified by the optimizer, the method is compiled to highly optimized machine code. This machine code, is stored in a region of memory called Code Cache*. Once machine code for the method is in the code-cache the JVM patches the method dispatch table to point to the machine-code, so subsequent calls to the method will execute the machine code directly on the CPU.

    The current openJDK implementations have two JIT compilers

    Client compiler (C1) which can produce machine code quickly however not with aggresive optimization.

    Server compiler C2 take a bit longer to produce machine code but can generate highgly optimized machine code.

    Previously C1 compiler was used for desktop applications and C2 for server applications, from JDK 8 onwards the default behaviour is to combine both the C1 and C2 compilers. Initially bytecode for a method is compiled with C1 and if the method is still identified as a hotspot and there is enough data collected by JVM to indicate a C2 compilation is worth while, the method is compiled a second time with further optimisations. This is usually reffered to at Tiered Compilation and is turn on by default of all JVM version 8 onwards.

    -XX:+/-TieredCompilation

    can be used to enbable disable this feature.

  4. Deoptimization: It turns out the JVM must be able to deoptimze code (i.e discard native code for a method and run the method in the interpreter).

    Why the need to deoptimize code

    One of the main advantages of a Managed profile guided optimizer is the ability to make speculative optimzations, this where the VM can make educated guesses and perform aggressive optimizations, however for this to work there is needs to be a way for the JVM to correct its assumptions just in-case the assumptions are no longer true, deoptimzation is fall back strategy where JIT compiled code is invalidated and execution falls back to interpreter, until further data is collected before another compilation is done.

    Inlining virtual method dispatches is not generally possible in static AOT compilers

    All methods are virtual by default in Java, but JIT compiler can remove the overhead of virtual method dispatches and turned it into static method call (fixed address call, indirection of vtable lookup can be eliminated) or even inlined at the call site if the at runtime there there is enough data to show that at the call site it mosly likely only a single type in a heirachy is executed.

    However if at a later point a new type from the class heirachy is loaded or present at the call site, the JIT will have to deoptimize, that is it will back out the optimization by invalidating the machine code from Code Cache and execution reverts back to running bytecode through the interpreter which will execute full virtual dispatch until next JIT compilation, this means generated machine code for a particular method can be invalidated by changes to the running program must be thrown away and potentially regenerated.

JIT Compiler(s) in Action

To demonstrate the fact that java does not just run is one way and see the JIT compiler in action. We execute a simple program shown below; in this program we initialise an int array of ten thousand random integers, we provide two implemenations of java.util.function.IntPredicate class IsEven and class IsOdd as the name suggest one implemention tests to see if an int is odd and other other tests to see if the int is even. We iterate through the array and execute the predicate and measure the execution time to test all items of the array, we repeate this a 100 times observe how the execution times vary and how the perfomance changes as the application continues to run. We do this process once using the IsOdd predicate and then using IsEven predicate.

package net.isaiahp;

import java.util.Random;
import java.util.function.IntPredicate;

public class Main {

    private static final int[] randomValues;

    public static final int MAX_SIZE = 10_000;

    static {

        Random random = new Random();
        randomValues = new int[MAX_SIZE];
        for (int i = 0; i < randomValues.length; i++) {
            randomValues[i] = random.nextInt();
        }

    }

    private static class IsEven implements IntPredicate {
        @Override
        public boolean test(int value) {
            return (value & 0x1) == 0;
        }
    }

    private static class IsOdd implements IntPredicate  {
        @Override
        public boolean test(int value) {
            return (value & 0x1) == 1;
        }
    }

    public static void main(String[] args) {

        final int ITERATIONS = 100;
        runIterations(ITERATIONS, new IsOdd(), 0);

        runIterations(ITERATIONS, new IsEven(), 100);

    }

    private static void runIterations(int count, IntPredicate predicate, int start) {
        for (int i = start; i < (start + count); i++) {
            long time = testValues(randomValues, predicate);
            System.out.println("#" + i + ", " + time);
        }
    }

    private static long testValues(int[] values, IntPredicate predicate) {

        long start = System.nanoTime();
        for (int i = 0; i < values.length; i++) {
            boolean unused = predicate.test(values[i]);
        }
        long end = System.nanoTime();
        return (end - start);
    }
}

To run the above program above we use openjdk 11

java -version
openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

#0, 769734
#1, 330086
#2, 326234
#3, 327919
#4, 324036
#5, 328149
#6, 343698
#7, 339803
#8, 227353
#9, 122103
#10, 80405
#11, 88627
#12, 110416
#13, 117678
#14, 127212
#15, 136834
#16, 132387
#17, 131895
#18, 130024

Generally the server compiler makes aggressive inlining decisions of non-final methods. As long as the inlined method is never overridden the code is correct. When a subclass is loaded and the method overridden, the compiled code is broken for all future calls to it. The code gets declared "not entrant" (no future callers to the broken code), but sometimes existing callers can keep using the code. In the case of inlining, that's not good enough; existing callers' stack frames are "deoptimized" when they return to the code from nested calls (or just if they are running in the code). When no more stack frames hold PC's into the broken code it's declared a "zombie" - ready for removal once the GC gets around to it

we execute the program and plot the execution times to help us visualize the performance.

run-time-plot

The graph shows the different stages of code execution. we see the run times for drop significantly once the code is compiled to native code by C2 compiler. we can also see a spike in runtime as code is deoptimized once a different implementation of a class is loaded and run

Pratical guidelines

Measure dont guess

As we can see the number of transformations our code goes through during runtime, it does really obsure the code that is written and the actual code that is executed on the machine. No what it does . JVM is does a good job detecting patterns in a very large software at runtime and applies optimization dynamically, this means the code that will write will very different to the machine code that is run on the cpu, meaning any micro optimizations we perform as a developer may have no affect or even a negative effect at all on the runtime performance. For example optimizing a method which is never going to be JIT compiled is wasted effort, identifying hot code should always be the first step.

Does that mean we should never care about performance ?

No not at all but what it means is we should foucs on our efforts on doing micro optimization we should focus writing JIT friendly code so the JVM can do it optimization better. The other area to focus on is areas where an JVM can't help, these are optimization at the domain level for example
replace a linear search with an efficient log n algorithm how our application make use of OS resources such as threads * implement caching if needed

To identify areas of bottlenecks it is important to collect data and measure the performance of a running application. Profilers are one tool every developer should use before attempting to make any performance improvements. Following quote from Knuth sums it up nicely

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail."

--Donald Knuth