What's new in the Performance JDK? Project Skogsluft, FFM vs Unsafe, Benchmark JITs - JVM Weekly vol. 71
Today's edition is strictly thematic, as we will be looking at various Java performance initiatives from different angles.
1. Project Skogsluft promises big changes to Java Flight Recorder.
With the imminent arrival of the Foreign Function and Memory APIs, which will hit JDK 22 in a stable version as early as March, the use of 'native' code is likely to become more widespread. This topic, growing in popularity, frequently appears in various forms in almost every edition of this review, including today's discussion. Consequently, there's a need to prepare the entire Java "toolbox" for the expected surge in usage. It's no wonder, therefore, that there is a proposal for a new Project to explore application performance.
Project Skogsluft, proposed by Jaroslav Bachorík from DataDog, aims to significantly enhance the Java Flight Recorder (JFR) profiling capabilities by introducing advanced features that bridge the gap between Java and native code execution profiling. Skogsluft is focusing on three main enhancements: an integrated stack traversal mechanism for native code as well, a flexible CPU sampling schedule tailored to the capabilities of different operating systems, and extended support for thread labeling in JFR. These aim to provide developers with more flexible profiling options and a richer context for analysis.
Now the project is seeking support from the OpenJDK developers. The proposed development of the project is to take place in a separate fork (from JDK 23 onwards) to gradually implement new features in the series of JEPs.
Why the name “Skogsluft”? It originates from the Norwegian language, where “skog” translates to “forest” and “luft” to “air”. When combined, “Skogsluft” directly translates to “forest air” or “forest breeze”. I find it to be a fitting name, considering the fresh perspective the project could introduce to profiling at JFR. However, having previously worked for a Norwegian company, I've learned that with their language, surprises are always possible.
As long as we're on JFR, I'll also toss you the text Java Flight Recorder on Kubernetes. Piotr Minkowski's article provides a comprehensive guide to using Java Flight Recorder (JFR) on Kubernetes using Cryostat for continuous monitoring of Java applications. Cryostat is a tool that enables remote management of Java Flight Recorder for applications running in containers, making it easier to monitor and diagnose application performance. It explains the full process from installing Cryostat using an operator (a component in Kubernetes that automates the deployment, scaling and management of applications and services, making it easier to manage complex, stateful applications) or a Helm chart, to creating a sample Spring Boot application that generates custom JFR events. The post also explains how to manage custom event templates and analyze data using JDK Mission Control.
And one last throw-in - Gunnar Morling, who has recently become known for the 1BRC described also in this newsletter, once published specifications of the hitherto undocumented JDK Flight Recorder output file format. Such a fascinating crossover, I couldn't let go. .
2. It's benchmarking time! FMA vs Unsafe
We started with profiling, but that's not the end of the broader performance topics. As promised earlier, we return to the topic of the Foreign Memory Access API.
There's been quite the buzz in the Java community about the recent proposal to phase out certain methods from the Unsafe
class, which has been a cornerstone for memory access and allocation outside the standard Java memory model. This change is significant because it affects the speed and flexibility of memory operations, crucial for high-performance libraries like Netty, Spark, Avro, and Kryo. These libraries now face the challenge of finding new ways to achieve their performance goals. However, Tomer Zeltzer, Senior Software Engineer at Yahoo, in his article Java's New FMA: Renaissance Or Decay? points out that despite the initial shock, there are alternatives to Unsafe that may show promise.
This is because it benchmarks the new Foreign Memory Access (FMA) API, developed as part of Project Panama, which is intended to replace Unsafe
by offering a more secure and officially supported solution. FMA looks promising, offering comparable read speeds and 42% faster heap writes compared to Unsafe, as well as almost three times faster off-heap memory accesses. The new API uses a Arena
object to allocate memory, and the memory segments that come from it cannot be individually freed, which can make it easier to avoid memory leaks. Although memory release is less granular than in Unsafe
, the new API seems to retain the same capabilities while offering a slightly more convenient interface.
Here a small disclaimer: the original article was slightly longer, but after a discussion on Reddit - involving Ron Pressler himself - some parts have been edited out. For the record: you can find the original version here, but personally I recommend just checking out the correct, latest version.
If you feel interested in the new API and feel curious to learn about its internal, a great opportunity is coming your way. At the beginning of the week, a video appeared on the Java channel, Foreign Function & Memory API - A (Quick) Peek Under the Hood, which I have already had the opportunity to watch and I really recommend it. From it, you will learn not only about the functionality of the Foreign Function and Memory API (these are probably already more or less known) but will learn assumptions about the design of the API itself so that it offers a "Java-first" approach to calling native functions and managing memory segments, circumventing the limitations of JNI and Direct Buffer API
, especially in numerically intensive areas like machine learning.
3. The Great JIT Compiler Comparison
Although the comparison of FFMA and Unsafe is an interesting topic, it pales somewhat at the monumentality of another of the benchmarks. This is because the text JVM Performance Comparison for JDK 21 by Ionut Balosin and Florin Blanaru provides a detailed comparison of the performance of various Just-In-Time (JIT) compilers, with a particular focus on JDK 21. The whole thing is over an hour (!) of reading (and that's in case you're a fast reader), and the benchmarks are divided into different categories, covering a wide range of scenarios, from low-level compiler optimisations to high-level Java API usage and classic programming problems, the kind you'd expect from FAANG recruitment or Advent of Code .
The compilers evaluated were the C2 (Server) JIT with OpenJDK 21 and two versions of the Graal JIT compiler (with GraalVM CE 21+35.1 and Oracle GraalVM 21+35.1), tested on x86_64 and arm64 architectures. Benchmarks were run using Java Microbenchmark Harness (JMH) version 1.37 on a MacBook Pro with an M1 chip (he admits - a very interesting, unusual choice) and a Dell XPS 15 with an Intel Core i7-9750H processor, under controlled conditions to minimise performance variability - it looks like the developers have put a lot of thought into preparation of the benchmark cases.
Overall, the Oracle GraalVM JIT compiler proved to be the fastest of the compilers tested, showing a significant performance advantage over the C2 JIT compiler, ranging from 23% on x86_64 to 17% on arm64. These gains are due to optimisations of the former Enterprise Edition, such as improved partial escape analysis and more aggressive inlining strategies.
Interestingly, while the results of the C2 JIT and GraalVM CE JIT compilers averaged similar, they differed significantly in terms of specific capabilities. C2 offers advanced support for intrinsics (embedded functions specifically handled by the processor) and vectorisation (processing data in blocks instead of one at a time), and better exception management than GraalVM CE JIT, but has limitations in inlining heuristics (inserting the content of a function in place of its call), devirtualisation (optimising method calls in objects) of complex calls and, in rare cases, may fail to compile, resulting in the use of less optimal execution paths (such as stopping compilation at C1 or even interpreted code level).
As I mentioned, the text is very comprehensive, so if you are curious (and have the patience), you can check out the individual results. I'll probably use it more than once, and it will serve me as a detailed (really detailed) source of information on the strengths and weaknesses of various JIT compilers. Of course, as is usual in this type of publication, the article ends with a reminder that while the Oracle GraalVM JIT compiler does indeed lead the way in terms of performance, the choice of JVM distribution should not be based solely on micro-benchmark results - there's one thing you can't hide: in 2024, GraalVM has a really good run.