18 Advanced Progressive Jit How It Affects Jvm Performance

18 Advanced Progressive JIT- How It Affects JVM Performance #

In the previous lesson, we learned about the Java Virtual Machine (JVM) stack, which is actually a two-layer stack. The first layer is the stack frame for methods, and the second layer is the operand stack for bytecode instructions.

Drawing 0.png

Diagram of the JVM stack

Creating a stack frame consumes resources, especially for common getter and setter methods in Java, which typically only have one line of code. It would be wasteful to create a stack frame every time.

In addition, the JVM interprets bytecode instructions for code execution. Consider the following code, where variable a is declared but not used after that. If bytecode instructions were interpreted and executed, a lot of unnecessary work would need to be done.

public class A {
    int attr = 0;
    public void test() {
        int a = attr;
        System.out.println("ok");
    }
}

Here are the bytecode instructions for this code. We can see the three unnecessary bytecode instructions: aload_0, getfield, and istore_1.

public void test();
    descriptor: ()V
    flags: ACC_PUBLIC
    Code:
      stack=2, locals=2, args_size=1
         0: aload_0
         1: getfield      #2                  // Field attr:I
         4: istore_1
         5: getstatic     #3                  // Field java/lang/System.out:Ljava/io/PrintStream;
         8: ldc           #4                  // String ok
        10: invokevirtual #5                  // Method java/io/PrintStream.println:(Ljava/lang/String;)V
        13: return
      LineNumberTable:
        line 4: 0
        line 5: 5
        line 6: 13

Furthermore, we learned that the garbage collector primarily targets the heap for garbage collection. The more objects created on the heap, the greater the pressure on garbage collection. If some variables can be directly allocated on the stack, the pressure on garbage collection will be reduced.

In fact, JVM has already done these optimizations through the JIT compiler (Just In Time Compiler). The main goal of the JIT compiler is to transform interpreted execution into compiled execution.

To improve the execution efficiency of hot code, the JVM will compile these codes into machine code relevant to the local platform at runtime and perform various levels of optimization. This is the function of the JIT compiler.

Drawing 2.png

As shown in the diagram above, the JVM will compile the frequently invoked code or the code frequently used in a for loop into machine code and cache it in the CodeCache area. The next time the same method is called, it can be directly used.

So, what are the means of JIT compilation? Let’s explore them in detail.

Method Inlining #

In Lesson 05, when we talked about JMH, we learned that the CompilerControl annotation can control some behaviors of the JIT compiler.

Among them, there is a mode called inline, which means inlining. It directly incorporates some short method bodies into the scope of the target method, as if the code is appended directly within the code block. This eliminates one method call, resulting in improved execution speed. This is the concept of method inlining.

You can use the -XX:-Inline parameter to disable method inlining. If you want more fine-grained control, you can use the CompileCommand parameter. For example:

-XX:CompileCommand=exclude,java/lang/String.indexOf

JMH uses this parameter to implement custom compilation features. In the JDK source code, there are also many methods annotated with @ForceInline, which will be forcibly inlined during execution. On the other hand, methods annotated with @DontInline will never be inlined.

Let’s take the 16th code example from Lesson 05, “Tools Practice: Benchmarking with JMH and Precisely Measuring Method Performance,” to see the effects of these JIT optimizations. The main code block is as follows:

public void target_blank() {
    // this method was intentionally left blank
}
}

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void target_dontInline() {
    // this method was intentionally left blank
}

@CompilerControl(CompilerControl.Mode.INLINE)
public void target_inline() {
    // this method was intentionally left blank
}

@CompilerControl(CompilerControl.Mode.EXCLUDE)
public void target_exclude() {
    // this method was intentionally left blank
}

The execution results are as follows. We can see that the performance difference between not using JIT compilation and using JIT compilation is more than 100 times, and the speed of using inlining is 5 times faster than not using inlining.

Benchmark                                Mode  Cnt   Score   Error  Units
JMHSample_16_CompilerControl.baseline    avgt    3   0.485 ± 1.492  ns/op
JMHSample_16_CompilerControl.blank       avgt    3   0.483 ± 1.518  ns/op
JMHSample_16_CompilerControl.dontinline  avgt    3   1.934 ± 3.112  ns/op
JMHSample_16_CompilerControl.exclude     avgt    3  57.603 ± 4.435  ns/op
JMHSample_16_CompilerControl.inline      avgt    3   0.483 ± 1.520  ns/op

JIT-compiled binary code is stored in the Code Cache area. This area has a fixed size and cannot be expanded once it is started. If the Code Cache is full, the JVM will not report an error, but will stop compiling. Therefore, the compilation execution will degrade to interpreted execution, and performance will be reduced. Moreover, the JIT compiler will continuously try to optimize your code, resulting in increased CPU usage.

By using the parameter -XX:ReservedCodeCacheSize, you can specify the size of the Code Cache area. If you find that the space has reached the limit through monitoring, you should increase its size appropriately.

Compilation Levels #

The HotSpot virtual machine includes multiple just-in-time compilers, namely C1, C2, and Graal. Starting from JDK 8, it adopts a layered compilation mode. You can often see their presence in the thread information obtained using the jstack command.

"C2 CompilerThread0" #6 daemon prio=9 os_prio=31 cpu=830.41ms elapsed=4252.14s tid=0x00007ffaed023000 nid=0x5a03 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

"C1 CompilerThread0" #8 daemon prio=9 os_prio=31 cpu=549.91ms elapsed=4252.14s tid=0x00007ffaed831800 nid=0x5c03 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

Using additional threads for just-in-time compilation allows uninterrupted execution of interpreted logic. The JIT usually runs in the background immediately after being triggered. After the compilation is completed, the corresponding bytecode is replaced with the compiled code. There are two ways of JIT compilation: method compilation and loop compilation.

Layered compilation divides the execution state of the Java virtual machine into five levels:

  • Interpretation of the bytecode;
  • Execution of C1 code without profiling;
  • Execution of C1 code with profiling of method invocation count and loop execution count only;
  • Execution of C1 code with all profiling;
  • Execution of C2 code.

Profiling refers to runtime program execution status data, such as the number of loop invocations, method invocations, branch jumps, type conversions, etc. For example, the hprof tool in the JDK is a profiler, which is essentially some intermediate statistical data.

Without enabling layered compilation, when the total number of method invocations and loop back edges exceeds the threshold specified by the -XX:CompileThreshold parameter, just-in-time compilation is triggered. However, when layered compilation is enabled, this parameter becomes invalid and a dynamic adjustment mechanism is used.

Escape Analysis #

Now let’s focus on escape analysis, a knowledge point that is often asked in interviews.

Let’s first review the question left in the previous lesson: Besides primitive types, can we always say that objects are allocated on the heap?

The answer is no. Through escape analysis, the JVM can analyze the usage scope of a new object and decide whether to allocate it on the heap. Escape analysis is now the default behavior of the JVM and can be turned off by using the -XX:-DoEscapeAnalysis parameter.

So what kind of objects are considered to escape? Let’s take a look at two typical cases below. As shown in the code, when an object is assigned to a member variable or a static variable and can be used externally, the variable escapes.

public class EscapeAttr {
    Object attr;
    public void test() {
        attr = new Object();
    }
}

Now let’s take a look at the following code, where the object is returned through the return statement. Since the program cannot determine whether this object will be used later, external threads can access the result, and the object escapes.

public class EscapeReturn {
    Object attr;
    public Object test() {
        Object obj = new Object();
        return obj;
    }
}

So what are the benefits of escape analysis? 1. Stack allocation

If an object is allocated in a subroutine and the pointers pointing to the object never escape, the object may be optimized for stack allocation. Stack allocation allows for quick creation and destruction of objects on the stack frame without allocating to the heap space, effectively reducing the pressure on garbage collection.

2. Object separation or scalar replacement

However, object structures are usually complex, so how can objects be saved on the stack? JIT can break down objects and replace them all with small local variables, a process called scalar replacement (a scalar is a variable that cannot be further divided, such as int, long, and other primitive types). In other words, after scalar replacement, the objects become local variables and can be easily allocated on the stack without modifying other code.

From the above description, we can see that not all objects or arrays are allocated on the heap. Due to the existence of JIT, if it is found that certain objects do not escape from a method, they may be optimized for stack allocation.

3. Synchronization elimination

If an object is found to be accessed by only one thread, synchronization for that object can be eliminated.

Note that this applies to synchronized, and not to Lock in JUC; Lock cannot be eliminated.

To enable synchronization elimination, the -XX:+EliminateLocks parameter needs to be added. Since this parameter relies on escape analysis, the -XX:+DoEscapeAnalysis option also needs to be enabled.

For example, in the following code, JIT determines that the object lock can only be accessed by one thread, so the influence of this synchronization can be removed.

public class SyncEliminate {
    public void test() {
        synchronized (new Object()) {
        }
    }
}

There is also a JMH test comparison for StringBuffer and StringBuilder in the repository, which shows that their efficiencies are not significantly different when lock elimination is enabled.

Benchmark Mode Cnt Score Error Units BuilderVsBufferBenchmark.buffer thrpt 10 90085.927 ± 95174.289 ops/ms BuilderVsBufferBenchmark.builder thrpt 10 103280.200 ± 76172.538 ops/ms

JITWatch #

You can use the jitwatch tool to observe some of the behaviors of JIT.

https://github.com/AdoptOpenJDK/jitwatch

Add parameters such as LogCompilation to the code’s startup parameters to enable recording, which will generate a jitdemo.log file.

-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading  -XX:+PrintAssembly -XX:+LogCompilation -XX:LogFile=jitdemo.log

Using the jitwatch tool, you can open this file and see detailed compilation results.

Drawing 5.png

Here is a piece of test code:

public class SimpleInliningTest {
    public SimpleInliningTest() {
        int sum = 0;
        // 1_000_000 is F4240 in hex
        for (int i = 0; i < 1_000_000; i++) {
            sum = this.add(sum, 99);
            // 63 hex
        }
        System.out.println("Sum:" + sum);
    }

    public int add(int a, int b) {
        return a + b;
    }

    public static void main(String[] args) {
        new SimpleInliningTest();
    }
}

From the results after execution, we can see that the hot for loop has been compiled using JIT, and the add method used inside has also been inlined.

Drawing 6.png

Summary #

JIT is the main optimization point of modern JVM and can significantly improve program execution efficiency. A performance improvement of an order of magnitude is possible from interpreted execution to the highest level C2. However, the just-in-time compilation process is very slow and consumes both time and space, so these optimization operations occur simultaneously with interpreted execution.

It is worth noting that JIT optimization may also result in deoptimization in certain cases. For example, redefineClass triggered by some hot deployment methods will invalidate the JIT compilation results and the related inlined code needs to be regenerated.

JIT optimizations are not always useful. For example, in the following code, if it is compiled and executed, it will result in an infinite loop. However, if you start with the parameter -Djava.compiler=NONE to disable JIT, it can continue execution.

public class Demo {
    static final class TestThread extends Thread {
        boolean stop = false;
        public boolean isStop() {
            return stop;
        }
        @Override
        public void run() {
            try {
                Thread.sleep(100);
            } catch (Exception ex) {
                ex.printStackTrace();
            }
            stop = true;
            System.out.println("END");
        }
    }

    public static void main(String[] args) {
        int i = 0;
        TestThread test = new TestThread();
        test.start();
        while (!test.isStop()) {
            System.out.println("--");
            i++;
        }
    }
}

We mainly looked at the concepts of method inlining and escape analysis, and learned that after some methods are optimized, objects are not necessarily allocated on the heap. After scalar replacement, they may be directly allocated on the stack. These knowledge points are often asked in interviews.

JIT optimizations are usually done silently by background processes, and we don’t need to pay too much attention to them. When the capacity of the Code Cache reaches the limit, it will affect the efficiency of program execution, but unless you have a lot of code, the default 240M is usually sufficient.