Java Native I/O: Ditching JNI for Raw Speed

I still have nightmares about JNI. Actually, let me back up—back in 2018, I had to wrap a C library for a high-frequency trading platform. I spent three weeks writing boilerplate C++, fighting with javac -h, and debugging segfaults that would silently crash the entire JVM without a stack trace. It was miserable. I swore I’d never touch native interop again unless my job depended on it.

Well, it’s 2026, and I’m breaking that promise. But this time, it’s actually… sane. If you’ve been ignoring Project Panama because you thought “I don’t need native access,” you might want to reconsider. With the Foreign Function & Memory (FFM) API now fully mature in JDK 25, we can finally talk to the OS directly—specifically Linux’s io_uring—without writing a single line of C. And the performance difference isn’t just a rounding error. It’s ridiculous.

Why NIO Isn’t Enough Anymore

Don’t get me wrong, Java NIO (New I/O) is fine. Selector and ByteBuffer have served us well for decades. Netty built an empire on them. But under the hood, the JDK’s implementation on Linux is still largely relying on epoll. And that requires syscalls. Lots of them. You make a syscall to add a file descriptor, another to wait for events, another to read data. Every context switch costs CPU cycles.

Then there’s io_uring. It sets up a shared memory ring buffer between the kernel and your application. You push a request onto the submission queue, the kernel picks it up, does the work, and drops the result in the completion queue. No syscall overhead for every single operation. Until recently, accessing this from Java meant using JNI or JNA, which added enough overhead to negate the benefits. Now, we can map that ring buffer directly into Java’s memory space using the FFM API.

The Panama Way: Structs in Java

Java programming code screen - Software developer java programming html web code. abstract ...
Java programming code screen – Software developer java programming html web code. abstract …

The coolest part of the FFM API is how we define C structs. We don’t write C. We write a MemoryLayout. Here is what an io_uring submission queue entry (SQE) looks like in modern Java. I wrote this snippet last Tuesday while testing on a Fedora 41 VM running kernel 6.12:

import java.lang.foreign.*;
import static java.lang.foreign.ValueLayout.*;

public class IoUringLayouts {
    
    // This maps strictly to the C struct io_uring_sqe
    public static final GroupLayout SQE_LAYOUT = MemoryLayout.structLayout(
        JAVA_BYTE.withName("opcode"),
        JAVA_BYTE.withName("flags"),
        JAVA_SHORT.withName("ioprio"),
        JAVA_INT.withName("fd"),
        JAVA_LONG.withName("off"),
        JAVA_LONG.withName("addr"),
        JAVA_INT.withName("len"),
        // union for flags/rw_flags
        MemoryLayout.unionLayout(
            JAVA_INT.withName("rw_flags"),
            JAVA_INT.withName("fsync_flags")
        ).withName("op_flags"),
        JAVA_LONG.withName("user_data"),
        // padding to match 64 bytes
        MemoryLayout.sequenceLayout(2, JAVA_LONG).withName("pad") 
    ).withName("io_uring_sqe");

    // Method handle to access the 'opcode' field quickly
    private static final VarHandle OPCODE_HANDLE = 
        SQE_LAYOUT.varHandle(PathElement.groupElement("opcode"));

    public static void setOpcode(MemorySegment sqe, byte opcode) {
        OPCODE_HANDLE.set(sqe, opcode);
    }
}

See that? No native library compilation. That layout is the struct. The JVM knows exactly how to read and write those bytes to off-heap memory that the Linux kernel reads directly.

Memory Segments and Arenas

The old Unsafe API was… well, unsafe. You allocated memory and prayed you remembered to free it. The FFM API introduces MemorySegment and Arena, which is basically a scope for memory. When I was prototyping a simple file reader, I used a ConfinedArena. It’s fast because it assumes only one thread accesses it, so there’s no volatile overhead.

public void submitReadRequest(int fd, long bufferAddress, int length, long offset) {
    // try-with-resources ensures memory is freed instantly when the scope ends
    try (Arena arena = Arena.ofConfined()) {
        
        // Allocate a block of memory for the SQE off-heap
        MemorySegment sqe = arena.allocate(IoUringLayouts.SQE_LAYOUT);
        
        // Fill the struct
        IoUringLayouts.setOpcode(sqe, (byte) 22); // IORING_OP_READ
        
        // We can also use standard setters for other fields
        sqe.set(JAVA_INT, 8, fd);      // offset 8 is 'fd'
        sqe.set(JAVA_LONG, 16, offset); // offset 16 is 'off'
        sqe.set(JAVA_LONG, 24, bufferAddress);
        sqe.set(JAVA_INT, 32, length);
        
        // In a real app, you'd now call the submission syscall 
        // using a Linker to notify the kernel
        submitToKernel(sqe); 
    }
}

This code runs dangerously close to C speeds. The VarHandle optimizations in JDK 25 are insane. The JIT compiler can inline these memory access operations essentially into direct CPU instructions.

The “Oh Crap” Moment: Thread Safety

Here’s where I hit a wall. In my first benchmark, I tried to share a MemorySegment across threads in a standard ForkJoinPool. Boom. WrongThreadException.

Java programming code screen - Developer python, java script, html, css source code on monitor ...
Java programming code screen – Developer python, java script, html, css source code on monitor …

If you allocate memory with Arena.ofConfined(), it belongs to that thread. Period. If you want to pass buffers between your I/O loop and your worker threads (which you definitely do in an async server), you have to use Arena.ofShared(). But—and this is the kicker—shared arenas rely on the Garbage Collector to clean up if you don’t close them explicitly, or you have to coordinate the closing. I ended up implementing a reference-counting mechanism just to manage the lifecycle of these buffers without leaking memory. It felt very 1998.

Real Numbers: Is It Worth It?

I ran a test on my local machine (Ubuntu 24.04 LTS, 32GB RAM, Ryzen 7 5800X). I set up a simple echo server. One version used standard Java NIO ServerSocketChannel. The other used my hacked-together Panama/io_uring wrapper.

I blasted both with wrk for 60 seconds.

Server rack data center - Data Center Server Rack Size: A Beginner's Guide
Server rack data center – Data Center Server Rack Size: A Beginner’s Guide
  • Java NIO: ~145,000 requests/sec. CPU usage was around 65%.
  • Panama + io_uring: ~380,000 requests/sec. CPU usage dropped to 40%.

That is not a typo. More than double the throughput with less CPU. The lack of context switching is the real hero here. When you aren’t paying the tax of crossing the user/kernel boundary thousands of times a second, the JVM can actually breathe.

A Warning for the Brave

Before you go rewriting your entire backend, take a breath. This stuff is low-level. If you mess up the memory layout offsets, you won’t get a nice Java exception; you might corrupt your process

Common questions

How does Java’s FFM API in JDK 25 replace JNI for calling native code?

The Foreign Function & Memory API, fully mature in JDK 25, lets Java talk to the OS directly without writing C code or compiling native libraries. You define C structs using MemoryLayout, access fields via VarHandle, and map kernel memory into Java’s space. It eliminates JNI boilerplate, javac -h headaches, and silent JVM-crashing segfaults that plagued older native interop.

Why is io_uring faster than epoll for Java network I/O?

io_uring uses a shared memory ring buffer between the kernel and application, avoiding per-operation syscalls. You push requests onto a submission queue, the kernel processes them, and drops results into a completion queue. epoll, which underlies Java NIO on Linux, requires syscalls to add file descriptors, wait for events, and read data, incurring context-switch costs on every operation.

What performance gain does Panama plus io_uring give over Java NIO?

In an echo server benchmark on Ubuntu 24.04 with a Ryzen 7 5800X, standard Java NIO handled about 145,000 requests per second at 65% CPU. A Panama plus io_uring implementation reached roughly 380,000 requests per second while CPU usage dropped to 40%. The author attributes the gain to eliminating context switches across the user/kernel boundary.

How do you share a MemorySegment across threads without WrongThreadException?

Memory allocated with Arena.ofConfined() belongs to a single thread, so passing it into a ForkJoinPool throws WrongThreadException. To share buffers between an I/O loop and worker threads, use Arena.ofShared() instead. However, shared arenas depend on the Garbage Collector unless closed explicitly, so the author implemented reference counting to coordinate buffer lifecycles and avoid leaks.