The TornadoVM Programming Model Explained with Juan Fumero - JVM Weekly vol. 155

Today, Juan Fumero explains, how TornadoVM works under the hood.

Dec 04, 2025

Year ago I announced that JVM Weekly had joined the Friends of OpenJDK (Foojay.io) family. Foojay.io is a dynamic, community-driven platform for OpenJDK users, primarily Java and Kotlin enthusiasts. As a hub for the “Friends of OpenJDK,” Foojay.io gathers a rich collection of articles written by industry experts and active community members, offering valuable insights into the latest trends, tools, and practices within the OpenJDK ecosystem.

Today we’re doing something slightly unusual in JVM Weekly – the article I’m sharing with you is not from this month. In fact, it’s almost a year and a half old. Why the time travel? Recently, in one of the Foojay podcast episodes (which I also reshare a bit later), Frank Delporte did a great tour of JVM-adjacent projects, including TornadoVM. That felt like the perfect excuse to come back to one of my all-time favourite pieces ever published on Foojay and ask the author whether he’d let me repost it in JVM Weekly. The answer was a “yes” ❤️.

So today you’re getting a proper engineering deep dive: Juan Fumero article on the architecture of TornadoVM. Back when the text was written, Juan was still at the University of Manchester - today he’s moved to Oracle and the Java Platform Group, and he is currently working on the Heterogeneous Accelerator Toolkit (HAT) within the OpenJDK project Babylon itself, using the new code-reflection APIs for GPU code generation from Java.

Important: Original article was published in May 2024 and TornadoVM and its ecosystem have evolved a bit since then. The core concepts in the article are still absolutely spot-on, though, and they do a fantastic job explaining why TornadoVM is one of the most interesting things happening around the JVM right now.

I genuinely hope you enjoy this piece as much as I did and that you share a bit of the excitement I felt while arranging this repost 🤩

The TornadoVM Programming Model Explained

Key Takeaways

TornadoVM offers an API for parallel programming on modern hardware that tackles data parallel, task parallel and pipeline parallel applications.
TornadoVM offers different abstractions to developers to be able to express parallel applications in Java, identify the methods to offload, and dispatch the application on the corresponding accelerators.
Task-Graphs and Execution Plans are the main building blocks of TornadoVM applications, allowing developers to compose complex graphs of computations and interact with the TornadoVM runtime to enable/disable profiling, enable debugging, or enable dynamic reconfiguration to select the best possible accelerators for the compute graphs.

In this blog post, I will explain how developers can start programming with TornadoVM and interact with the TornadoVM runtime.

I will explain the TornadoVM programming model and I will show an example from scratch that illustrates all the steps to be done to run on GPUs (or any other TornadoVM-compatible hardware).

Overview of the TornadoVM Software Stack

Let’s start with a general overview of the TornadoVM Software stack and the main components, as shown in the following Figure.

At a high level, TornadoVM exposes an API for developers.

This API contains the building blocks to be used by developers to express parallel applications, identify which methods to use and offload and run applications on GPUs and FPGAs.

This API is the main content of this blog post.

Under the hoods, the TornadoVM runtime system and the Just-In-Time compiler, optimise, compile and dispatch the input application on heterogeneous hardware.

We will not go into the details of the runtime system in this tutorial, but in a nutshell, the TornadoVM JIT compiler extends the GraalVM JIT compiler to offload Java code to low-level GPU-friendly code, such as CUDA, OpenCL and SPIR-V.

Then, the TornadoVM runtime takes care of data migration, data handling and execution of the application for the target hardware.

Thus, in a way, TornadoVM is a full-package solution that is not only used for programming on modern hardware but also for orchestrating, running, and optimising a subset of Java programs on heterogeneous hardware.

How do we start programming with TornadoVM?

So, let’s focus now on the API level and how developers can start using TornadoVM to program their applications. To understand the main ideas behind each API component in TornadoVM, we need to think about the following aspects:

How do we represent parallelism, in a programming language that was not primarily designed for parallelism and modern hardware? Note that, there are different types of parallelism, such as data-parallelism, task-parallelism and pipeline-parallelism, and we would like to run a subset of Java programs on explicit parallel hardware, such as GPUs. Thus, ideally, we would like an API to express these types of parallelization in our programs in an easy manner, and be able to dispatch our parallel programs on a wide diverse of modern hardware, such as GPUs, FPGAs, RISC-V accelerators, etc.
How do we identify which functions (or methods) to offload? We usually have large programs with hundreds of classes and thousands of methods, but how do we select the methods to be offloaded (to be transformed and migrated to the target accelerator)?
How do we run on parallel hardware? And how do we profile, get the results, etc?

The TornadoVM API tries to tackle all these questions with different API components. Let’s discuss this briefly one by one.

1. Representing Parallelism

There are two ways to express parallelism with TornadoVM, and developers can choose one or the other:

Via annotations in the source code: Annotating Java for-loops with the @Parallel annotation and reduction parameters with the @Reduce annotation for those loops that can be parallelisable. This means that, if the loop does not have data dependencies, we can add the @Parallel annotation to indicate to the TornadoVM compiler and runtime system that the loop/s can be parallelisable, and, therefore, the JIT compiler can perform the corresponding code transformations to convert Java sequential loops into explicit parallel loops. It is also possible to add nested parallel loops. This API is convenient for non-GPU/FPGA experts to get easier and quicker access to hardware accelerators.
Via an explicit parallel kernel API: Developers use the KernelContext API. This second style is a lower-level API compared to the annotations, and it is more similar to OpenCL and oneAPI to program explicit parallel kernels. The use of the parallel kernel API is sometimes more convenient for developers who are already familiar with OpenCL, oneAPI or CUDA and want to port existing kernels into TornadoVM.

In this post, we will focus on the first option, using the TornadoVM annotations for the for-loops. Let’s see this in practice through an example.

Let’s say we want to initialise an array of floating point numbers (in fp32) and perform some computations with this array.

Thus, let’s code two methods for 1) initialization of an array; and b) perform a computation (e.g., compute the SQRT function from the Math library):

public class MySample { 
    public static void init(FloatArray array) {
        for (int i = 0; i < data.getSize(); i++) {
            array.set(i, i * 2);
        }
    }

    public static void computeSqrt(FloatArray array) {
        for (int i = 0; i < data.getSize(); i++) {
            float value = array.get(i);
            array.set(i, Math.sqrt(value));
        }
    }

    public FloatArray compute(FloatArray array) {
        init(array);
        computeSqrt(array);
        // do something else
        // ...  
        return array;
    }
}

A few things to highlight regarding this code snippet:

We see a data type called FloatArray. This data type is provided by the TornadoVM API, and it contains (as the name suggests) an array of floating point numbers (fp32). The array is stored off-heap using the Java Panama Memory API. For this tutorial, we are going to stay with our FloatArray, but feel free to scan the API and Collections of TornadoVM to see all the supported types to see all supported types.
We see that each method returns void, and the inputs and outputs are passed as arguments to the methods. This is intentional since the TornadoVM will offload each of the Java methods to run in parallel on the target device (e.g., a GPU). Since the target hardware of TornadoVM allows developers to run many threads (usually > 1000s threads), it would be almost impossible to determine which thread/s returns a value from the entire method efficiently. Thus, we provide the method in a more “Tornado”-friendly way to approach the next step (use the annotations in the for-loops).

Let’s now introduce the @Parallel annotation for parallel loops of the methods we want to offload.

When using the annotation, it is the responsibility of the developer to include the @Parallel annotation in the loops that do not have data dependencies.

This annotation will indicate the TornadoVM compiler that we want to run the whole loop in parallel using many threads.

But, how many threads?: The number of threads depends on the loop bound of the annotated loop. But this is transparent to Java developers, and the TornadoVM runtime and the JIT compiler work together to set these values.

Going back to our example, let’s add the annotations for the two methods we potentially want to offload: the init and the computeSqrt methods.

public class HelloTornado { 
    public static void init(FloatArray array) {
        for (@Parallel int i = 0; i < data.getSize(); i++) {
            array.set(i, i * 2);
        }
    }

    public static void computeSqrt(FloatArray array) {
        for (@Parallel int i = 0; i < data.getSize(); i++) {
            float value = array.get(i);
            array.set(i, TornadoMath.sqrt(value));   // << Use TornadoMath class instead
        }
    }

    public FloatArray compute(FloatArray array) {
        init(array);
        computeSqrt(array);
        // do something else
        // ...  
        return array;
    }
}

Furthermore, for this step, we transform the Math.sqrt into TornadoMath.sqrt. TornadoVM offers a math library, similar to Java. The reason for having this library is that, for some GPU/FPGA devices, double (fp64) types are not supported for all GPUS/accelerators. For example on Intel ARC GPUs, or the latest Intel HD graphics.

However, we can still compute sqrt or many of the math functions using less precision, such as in fp32 (float in Java), or even less. To allow this integration, TornadoVM offers this API that the JIT compiler can understand and provide the correct replacements using the narrower types (e.g., fp32 or fp16).

Besides, there is another reason why you might want to use the TornadoMath library for your applications when running on GPUs, and that’s performance.

CPUs, usually offer the same performance when computing fp32 and fp64 operations. However, this is not usually the case for current GPUs. For example, while you can compute operations in double (fp64) precision on NVIDIA GPUs, there are usually fewer functional units per GPU thread in fp64 compared to fp32. And this means that, if operating in fp64, CUDA threads need to share the functional units. To give an example, using the RTX 4090 GPU, the ratio is 1:64. Thus. be careful! In GPU programming, think twice before you operate using double data types.

Now, let’s move on to create our compute graphs.

2. Identifying the Java Methods to Offload

TornadoVM offloads code at the method level (similar to the JIT compiler in Hotspot). To specify which method/s to offload, TornadoVM offers a Task-Graph API, in which each node in the graph represents a task.

Besides, we add the data inputs and outputs of our computation to the graph. This is useful for the TornadoVM runtime, which needs to perform data migration between the host (main CPU) and the accelerator (e.g., the GPU), since in many cases, the computing system does not share the same memory for both accelerators and the CPU.

To continue with our example, we build the Task-Graph as follows:

public class HelloTornado { 
    public static void init(FloatArray array) {...}

    public static void computeSqrt(FloatArray array) {...}

    public FloatArray compute(FloatArray array) {
        TaskGraph graph = new TaskGraph(”graph”)  
          .transferToDevice(DataTransferMode.EVERY_EXECUTION, array)
          .task(”init”, HelloTornado::init, array)
          .task(”compute”, HelloTornado::computeSqrt, array)
          .transferToHost(DataTransferMode.EVERY_EXECUTION, array);
        return array;
    }
}

We see that, for creating and defining all data and tasks of our computation, we use mainly three methods from the Task-Graph API:

transferToDevice: it defines all objects to be copied to the target accelerator. It also defines a mode for each of the objects. In this case, we specify that the array object must be transferred every time the whole graph is executed. TornadoVM also supports read-only copies.
task: This is the method identification. To identify uniquely every method, we give a name. This name is useful to check with the profiler, change the device at runtime, etc. The next parameter of this method is the reference to an existing Java method. This could be a lambda expression, an instance method, or a static method. In our example, we use a reference to a static method.
We can define as many tasks as we want, and, as soon as they are accessible from the class that instantiates the task graph, each referenced method could be located in different Java classes.
transferToHost: it defines the data objects in which we expect the output of our computation. This could be one or many objects. Additionally, we pass a mode.
Usually, we want the output to be transferred to the host right after the execution of a task graph has finished. However, in some cases (e.g., iterative algorithms), we want to execute a graph multiple times and only transfer the data at the end. In this case, we could also define data to be transferred UNDER_DEMAND and use another data structure (called TornadoExecutionPlan) to copy data under demand.

Note that the Task-Graph is never executed. It only defines which method/s, and which object/s to use. To execute a Task-Graph, we need to instantiate an object of type TornadoExecutionPlan.

3. Deploying and Running Task-Graphs

We are almost done. To execute a task graph, we need to instantiate an execution plan. The execution plan, receives, as an argument, a snapshot of an existing task graph.

Wait, a snapshot? what is this?

A snapshot is an object that contains an immutable task graph, which in turn, is a task graph that cannot be changed (e.g., add new tasks, or add new data).

This is by design to avoid changing the task graph (meaning appending more tasks or adding more data) while we execute code on the GPU.

Let’s go back to our example and create an execution plan from the graph object:

TornadoExecutionPlan plan = new TornadoExecutionPlan(graph.snapshot());

And now, we can call the execute method:

plan.execute();

Done! If we do not specify anything else, the execute method in a blocking call, and it will optimise, compile and run the whole task graph on the default device.

For reference, this is the entire code of our example:

public class HelloTornado { 
    public static void init(FloatArray array) {
        for (@Parallel int i = 0; i < data.getSize(); i++) {
            array.set(i, i * 2);
        }
    }
    public static void computeSqrt(FloatArray array) {
        for (@Parallel int i = 0; i < data.getSize(); i++) {
            float value = array.get(i);
            array.set(i, TornadoMath.sqrt(value));   
        }
    }
    public FloatArray compute(FloatArray array) {
        TaskGraph graph = new TaskGraph(”graph”)  
            .transferToDevice(DataTransferMode.EVERY_EXECUTION, array)
            .task(”init”, HelloTornado::init, array)
            .task(”compute”, HelloTornado::computeSqrt, array)
            .transferToHost(DataTransferMode.EVERY_EXECUTION, array);
        TornadoExecutionPlan plan = new TornadoExecutionPlan(graph.snapshot());
        plan.execute())
        return array;
    }
}

Interacting with the Dispatcher

We can also change the default decisions of the TornadoVM runtime, and perform some actions (e.g., enable the profiler, change the hardware accelerator, enable dynamic reconfiguration, etc).

The TornadoVM Execution Plan follows a builder pattern to specify all these actions.

For example, to change a device:

int driverIndex = 0;
int deviceIndex = 1;
TornadoDevice device = TornadoExecutionPlan.getDevice(driverIndex, deviceIndex);
plan.withDevice(device)
    .execute();

And we can execute again, without the need to build a new task-graph.

Update from TornadoVM (2025):

Since publication, the TornadoVM Execution Plan has been extended with further experimental features. One of these features is Dynamic Reconfiguration, which was available in TornadoVM till version 1.1.1 and has since been deprecated in newer versions.

If we want to enable dynamic reconfiguration (a feature of TornadoVM to discover the best device depending on a policy), we can enable it as follows:

plan.withDynamicReconfiguration(Policy.PERFORMANCE, DRMode.PARALLEL)
    .execute();

In this call, we specify that we want to select the best device in terms of performance, and the TornadoVM should evaluate all permutations in parallel.

What this API call will trigger, is to compile and run for all hardware accelerators available in our system, and choose the best device that follows the policy we specified.

Cool, isn’t it? If you want to know more about dynamic reconfiguration, this paper contains more details.

There are more methods in the TornadoExecutionPlan class. We covered just two of them. If you are interested, I invite you to read the documentation and the examples. Additionally, I recorded a video showing, step by step, some of these functions in action.

Summary

In this article, we have explained the basics of the TornadoVM programming model and the main API blocks.

With these tools, developers can start integrating these components into their applications and start accelerating portions of the Java programs on hardware accelerators, such as GPUs.

If you want to know more, I invite you to explore the example suite in TornadoVM to get an idea of the types of applications that can be expressed using the TornadoVM API with more complex use cases.

Originally published at Foojay.io on May 2024.

And now, let’s review some of the other cool things that appeared on Foojay.io last month… but as promised, I will start with something more connected.

Foojay Podcast #82 – Leyden, Babylon, Panama, TornadoVM

As I mentioned earlier, the reason for republishing that slightly older piece is that Frank Delporte recently ran a whole series of interviews on a very juicy topic: the latest developments around the JDK.

This episode is a neat “hallway track tour” across Devoxx and J-Fall: Moritz Halbritter explains how Project Leyden improves Java startup (and how you can already feel it today in Spring Boot), John Ceccarelli from Azul adds the story of moving from x86 to ARM/Graviton plus general JVM performance tricks, Balkrishna Rawool shows why the Vector API from Panama landed perfectly in the AI space - even though it launched before the LLM boom - and the TornadoVM crew (Christos Kotselidis and Michalis Papadimitriou) explain how to run large models on GPUs without leaving pure Java, and how this all ties into Project Babylon.

Highly recommended podcast - and that’s only the beginning of the goodies.

Will OpenJFX Be Merged Into OpenJDK? It Would Be a Perfect Match with Java on Mobile!

Another interesting article from Frank – with a gloriously long title – Will OpenJFX Be Merged Into OpenJDK? It Would Be a Perfect Match with Java on Mobile! recaps why JavaFX left the JDK in the first place (bloat, independent release cycle, distribution issues), and then shows that in 2025, after several Java evolutions, those reasons are much weaker – while the potential upside of bringing it back into OpenJDK is much bigger.

The key point: in parallel, Johan Vos and Gluon are pushing hard on the “Java on Mobile” initiative – instead of maintaining a patched toolchain, they’re building OpenJDK natively for mobile, with a pipeline that already runs Hello World on iOS, and a roadmap that includes the iOS Simulator, Android, Leyden-based optimizations, and full-blown JavaFX apps as native mobile apps. From the authors’ perspective, if JavaFX returns to OpenJDK and OpenJDK becomes a first-class citizen on mobile, we finally get an honest “write once, run everywhere”: the same JavaFX code on desktop, mobile, and embedded, without sidecar toolchains – which is a pretty compelling vision, even if in the age of Kotlin Multiplatform it might feel a bit… optimistic.

Visitor Pattern “done right” – use the language, not the pattern

The Visitor Pattern – ‘Revisited’ using Data Oriented Programming technique by Wim De Troyer reminds us that many classic patterns (like Strategy) existed mostly because the language didn’t have better tools - and shows how Java 8 lambdas pretty much dissolved Strategy into normal code.

The author takes the Visitor pattern for a spin using a book curation system: the classic implementation with a type hierarchy, visitors, accept methods, and several levels of intermediate classes. To defuse that complexity, the whole thing is rewritten using Java 21 idioms: records, sealed interfaces as sum types, and beefed-up pattern matching in switch.

The result? The Visitor “disappears” - what’s left is a clean data model plus one well-designed switch, which the compiler checks for exhaustiveness. On the meta level, it’s a great example to show your team when it’s no longer worth torturing GoF patterns and instead remap them to the modern language.

There is nothing wrong per se in Design Patterns, but sometimes they are giving Java a bad raps.

GraphRAG vs RAG – when vectors are not enough

GraphRAG vs RAG – when vectors are not enough by Thibaut Gourdel does a nice job of cleaning up the vocabulary: classic RAG is a vector knowledge base – we chunk documents, build embeddings, and then look for semantically similar pieces to the prompt. This works great when the task is “find me the right fragment of docs,” but starts to fall apart when the question needs multi-hop reasoning, traversing relationships between entities, or understanding the structure of a large document rather than isolated paragraphs.

GraphRAG adds a knowledge graph to the mix - explicitly modeled entities and relations - and uses the LLM to walk the graph, not just the vectors. That gives better accuracy (benchmarks like Lettra show up to +35% for complex queries) and more explainability (you can see the path in the graph), but comes at a cost: you need to build and maintain the graph (with help from frontier LLMs), handle traversal complexity, latency, and operational overhead.

That’s the reason, why your AI demo do not work on production that goos as expected.

So the author pushes hybrid setups: vectors for “fast recall,” graph for reasoning, ideally in a single platform like MongoDB Atlas that is document, vector, and graph at the same time. From the perspective of agent architecture and compliance, it’s a great piece to explain when “we dumped PDFs into vectors” stops being enough – and when you need to move toward explicit domain modeling.

Foojay Podcast #83 – OpenJDK evolutions and tricks from the trenches

We started with a podcast, and unusually, we’ll end with one too = Frank really went all-in this time, and the two episodes form a nicely coherent combo.

The second podcast again uses the “many mini-stories” format: Johan Vos talks about the history of Java on Linux and how you can both preserve language/runtime stability and keep pushing the “write once, run everywhere” vision forward (yes, JavaFX on mobile and the openjdk-mobile initiative pop up again). Stephen Chin adds the perspective of modern JavaFX clients and education (including his daughter’s book on teaching kids to code), while Joseph Phillips revisits the REST vs gRPC debate and asks bluntly: with virtual threads around, do async APIs really make sense everywhere?

Then François Martin covers microbenchmarks in JMH, Wouter De Geus shares his journey from finance into dev and open-sourcing code (from inside the Dutch tax office!), and Roald Nefs shows how to actually use the Foreign Function & Memory API to “hack” cars - and what kind of security implications that brings.

All in all, it’s a ready-made bundle of anecdotes and “Java in the real world” examples - straight from the conference hallway track.

And it’s all, folks

Discussion about this post

Ready for more?