JVM Internals & Performance

Benchmarking & Measuring Performance

15 min Lesson 6 of 13

Benchmarking & Measuring Performance

Measuring performance in Java is deceptively hard. The JVM is a highly adaptive runtime: it interprets bytecode, profiles hot paths, compiles them to native code on the fly, garbage-collects, and inlines methods across call-site boundaries — all while your benchmark is running. Ignore these dynamics and your numbers are fiction. This lesson teaches you why naive timing lies, what warm-up is and why it matters, and how the Java Microbenchmark Harness (JMH) solves the problem correctly.

Why Naive Timing Lies

The first instinct of most developers is to wrap code in System.nanoTime() calls and compute the difference. That approach is broken for microbenchmarks in several distinct ways.

JIT compilation is not instantaneous. When the JVM first encounters a method it interprets it — slowly. After a method has been called roughly 10,000 times (the C1 threshold) the JIT compiles it to optimised native code. After ~10,000 more it may recompile with aggressive C2 optimisations. If your benchmark runs a method 100 times, the first 80 executions are interpreted and the last 20 are JIT-compiled: the average is meaningless.

Dead-code elimination. If the JIT can prove that a computation's result is never used, it eliminates the computation entirely. A benchmark that computes a sum but throws the result away may measure nothing at all.

Constant folding. A loop body that depends only on compile-time constants may be evaluated once and the loop removed. You measure a no-op.

GC interference. A garbage collection pause mid-measurement inflates your timing. Without controlling GC, successive runs differ by the GC's mood.

OS scheduling jitter. Thread preemption, CPU frequency scaling (turbo boost, power saving modes), and NUMA memory effects all add noise.

Never publish microbenchmark results from a simple timing loop. The JIT's warm-up curve means early iterations are unrepresentative. A single number taken after 100 iterations may be 5–50× slower than the steady-state cost your production code actually pays.

Understanding Warm-Up

Warm-up is the period during which the JVM transitions a piece of code from interpreted execution to fully optimised native code. The JVM's tiered compilation pipeline has multiple levels:

Level 0: Pure interpretation.
Level 1–3: C1 compiler (client compiler) — fast compilation with basic optimisations.
Level 4: C2 compiler (server compiler) — aggressive speculative optimisations, inlining, escape analysis.

A benchmark should only measure Level 4 steady-state throughput. That means running the code enough iterations — typically thousands of calls — before you start recording measurements. The exact number of warm-up iterations needed varies by method complexity and the JVM's profiling decisions.

Consider this deceptive example:

import java.util.List;

public class NaiveBenchmark {

    static int sumList(List<Integer> list) {
        int total = 0;
        for (int v : list) total += v;
        return total;
    }

    public static void main(String[] args) {
        var data = List.of(1, 2, 3, 4, 5);

        // First measurement — mostly interpreted
        long t0 = System.nanoTime();
        for (int i = 0; i < 100; i++) sumList(data);
        long t1 = System.nanoTime();
        System.out.printf("First 100 iters avg: %.0f ns%n", (t1 - t0) / 100.0);

        // Second measurement — JIT has now compiled sumList
        long t2 = System.nanoTime();
        for (int i = 0; i < 100; i++) sumList(data);
        long t3 = System.nanoTime();
        System.out.printf("Second 100 iters avg: %.0f ns%n", (t3 - t2) / 100.0);
    }
}

In practice the second block is often 5–20× faster than the first, even on trivial code. Neither number is wrong — they just measure different JVM states. Production code always runs in the steady state; your benchmark should too.

The Java Microbenchmark Harness (JMH)

JMH, developed by the JVM performance engineers at Oracle and distributed via OpenJDK, is the standard tool for writing correct Java microbenchmarks. It handles warm-up, dead-code elimination prevention (via Blackhole and result consumption), forked JVM processes, and statistical aggregation automatically.

Adding JMH to a Maven Project

<!-- pom.xml dependency -->
<dependencies>
    <dependency>
        <groupId>org.openjdk.jmh</groupId>
        <artifactId>jmh-core</artifactId>
        <version>1.37</version>
    </dependency>
    <dependency>
        <groupId>org.openjdk.jmh</groupId>
        <artifactId>jmh-generator-annprocess</artifactId>
        <version>1.37</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

A Minimal JMH Benchmark

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.util.concurrent.TimeUnit;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.IntStream;

@BenchmarkMode(Mode.AverageTime)          // measure average time per operation
@OutputTimeUnit(TimeUnit.MICROSECONDS)    // report in microseconds
@State(Scope.Benchmark)                   // one instance shared by all threads
@Warmup(iterations = 5, time = 1)        // 5 warm-up iterations, 1 second each
@Measurement(iterations = 10, time = 1)  // 10 measurement iterations
@Fork(2)                                  // run in 2 fresh JVM processes
public class ListSumBenchmark {

    private List<Integer> data;

    @Setup
    public void setup() {
        data = new ArrayList<>(IntStream.rangeClosed(1, 1_000)
                                         .boxed()
                                         .toList());
    }

    @Benchmark
    public void imperativeSum(Blackhole bh) {
        int total = 0;
        for (int v : data) total += v;
        bh.consume(total);   // prevents dead-code elimination
    }

    @Benchmark
    public void streamSum(Blackhole bh) {
        int total = data.stream().mapToInt(Integer::intValue).sum();
        bh.consume(total);
    }
}

The Blackhole parameter is essential. Without consuming the result, the JIT is free to determine that the computation is unused and eliminate it entirely. bh.consume(value) creates a fake dependency that defeats this optimisation without adding meaningful overhead itself.

Key JMH Annotations Explained

@BenchmarkMode — Mode.AverageTime, Mode.Throughput, Mode.SampleTime, or Mode.SingleShotTime. Choose based on what matters: average latency, throughput, or percentile distribution.
@Fork — runs each benchmark in a fresh JVM. This isolates JIT state between benchmarks and prevents one benchmark's profiling decisions from affecting another. Never run with @Fork(0) in production measurements.
@Warmup / @Measurement — control the warm-up and measurement phases separately. Warm-up iterations are discarded; only measurement iterations contribute to the reported result.
@State — Scope.Benchmark (shared), Scope.Thread (per-thread copy), Scope.Group (per benchmark group). Determines object sharing in multi-threaded benchmarks.
@Setup / @TearDown — initialise and clean up state; never placed inside the @Benchmark method.

Running JMH and Reading the Output

Build a fat JAR and run it from the command line:

mvn clean package -DskipTests
java -jar target/benchmarks.jar ListSumBenchmark -rf json -rff results.json

JMH prints a table like this:

Benchmark                   Mode  Cnt   Score   Error  Units
ListSumBenchmark.imperativeSum  avgt   20   2.341 ± 0.041  us/op
ListSumBenchmark.streamSum      avgt   20   3.912 ± 0.088  us/op

The ± value is a 99.9% confidence interval across forks and iterations. A narrow interval means the measurement is stable. A wide interval means there is high variance — more iterations, more forks, or a more isolated machine are needed.

Run benchmarks on an otherwise idle machine. Browser tabs, email clients, and background processes all steal CPU time and inject noise. For critical measurements, pin the benchmark process to a dedicated CPU core with taskset (Linux) and disable frequency scaling.

Common Benchmarking Traps to Avoid

Benchmark loop fusion: if your benchmark method is so fast that each invocation is a few nanoseconds, the JIT may merge iterations and amortise setup costs. Use @OperationsPerInvocation or restructure the method.
Too few warm-up iterations: complex methods with deep call graphs need more warm-up. Start with at least 5 × 1-second iterations and verify with -prof gc that GC is not interfering.
Benchmarking the wrong thing: measuring HashMap.get() with String keys constructed inside the benchmark body measures String allocation and hashing, not retrieval alone.
Ignoring allocation rate: use -prof gc to see bytes allocated per operation. A method that allocates heavily will trigger GC pauses in production even if it looks fast in isolation.

Summary

Naive timing with System.nanoTime() produces unreliable results because JIT compilation, dead-code elimination, and GC run concurrently with your measurement. Warm-up is the JVM's transition from interpreted to fully optimised code — measurements taken before warm-up completes reflect interpreter performance, not production performance. JMH solves all these problems through controlled warm-up phases, forked JVM processes, Blackhole result consumption, and statistical aggregation. Use JMH for any benchmark you intend to act on.