Benchmarks
Bloviate ships with two complementary performance suites: CPU micro-benchmarks that
measure raw value generation in isolation (via JMH), and
end-to-end fills that time DatabaseFiller.fill() against real databases running in
containers. The CPU suite isolates the cost of generating values and dispatching them
per cell; the end-to-end suite captures everything a real run pays for — metadata fetch,
generation, batching, and JDBC round-trips. Every number on this page is reproducible:
the benchmarks live in the (never-published) bloviate-benchmarks module, use fixed seeds
so each iteration generates identical data, and record the exact hardware, JDK, and database
image they were measured on.
Two benchmark suites
Section titled “Two benchmark suites”The work being optimized spans two regimes, so there are two suites:
| Suite | What it measures | Where the wins show up | Tool |
|---|---|---|---|
| CPU micro-benchmarks | raw value generation and per-cell generator dispatch, no database | hot-loop micro-opts, generator-level changes | JMH |
| End-to-end fill | DatabaseFiller.fill() throughput (rows/sec) against a real DB | parallel table fill, commit tuning, batch rewrite | plain JUnit runner |
The bloviate-benchmarks module isn’t published to Maven Central and nothing in
bloviate-core depends on it — running the benchmarks is entirely opt-in and adds no weight
to the library.
CPU micro-benchmarks (JMH)
Section titled “CPU micro-benchmarks (JMH)”Pure-CPU, no Docker. The benchmarks resolve generators exactly the way TableFiller does
(DatabaseSupport.getDataGenerator(column, random)), so the numbers reflect real engine cost.
GeneratorBenchmark— throughput ofgenerate()/generateAsString()across a spread of column types (int, numeric, varchar, timestamp, uuid, jsonb, …).RowDispatchBenchmark— models the inner loop ofTableFiller.fill(): the per-cellgeneratorMap.get(column)HashMap lookup plusgenerate()over a wide row. This is the baseline for the “index generators by array position” change.
Build the runnable uber-jar and run it:
./mvnw -q -DskipTests -pl bloviate-benchmarks -am packagejava -jar bloviate-benchmarks/target/benchmarks.jarUseful invocations (standard JMH CLI — flags take a space):
# one benchmark, quickjava -jar bloviate-benchmarks/target/benchmarks.jar RowDispatchBenchmark
# restrict GeneratorBenchmark to specific column typesjava -jar bloviate-benchmarks/target/benchmarks.jar GeneratorBenchmark.generate \ -p genCase=UUID,VARCHAR_SHORT
# fewer forks/iterations while iterating locallyjava -jar bloviate-benchmarks/target/benchmarks.jar -f 1 -wi 2 -i 3End-to-end fill (real database)
Section titled “End-to-end fill (real database)”Opt-in JUnit runners tagged @Tag("benchmark"), one per database, reusing the same
TestContainers stack as the core tests. They are skipped by a normal build and only run
under the bench profile. Requires Docker (OrbStack works).
Each measured iteration truncates the schema, times a full DatabaseFiller.fill() (metadata
fetch included), and prints rows/sec. A fixed seed makes every iteration generate identical
data, so timings are comparable across runs and across optimization branches.
PostgresFillBenchmark— TPC-C schema, a deliberately wide FK-free schema (create_wide.postgres.sql, the between-table parallel-fill target), and a single dominant table (create_single.postgres.sql, the intra-table partitioning target).MySqlFillBenchmark,CockroachFillBenchmark— TPC-C schema.
# all end-to-end benchmarks (Postgres + MySQL + CockroachDB)./mvnw -pl bloviate-benchmarks -am -Pbench test
# just Postgres, just the wide schema, larger dataset./mvnw -pl bloviate-benchmarks -am -Pbench test \ -Dtest='PostgresFillBenchmark#wide' -Dbench.rows=500000
# intra-table partitioning: one big table, sequential baseline vs 8-way partitioned./mvnw -pl bloviate-benchmarks -am -Pbench test \ -Dtest='PostgresFillBenchmark#singleTable' -Dbench.rows=2000000 -Dbench.threads=1./mvnw -pl bloviate-benchmarks -am -Pbench test \ -Dtest='PostgresFillBenchmark#singleTable' -Dbench.rows=2000000 -Dbench.threads=8Output lines look like:
[bench] postgres/wide iteration 1 500,000 rows 3.612 s 138,430 rows/s[bench] postgres/wide best 500,000 rows 3.580 s 139,664 rows/s[bench] postgres/wide mean 500,000 rows 3.601 s 138,850 rows/sTuning (system properties)
Section titled “Tuning (system properties)”| Property | Default | Meaning |
|---|---|---|
bench.warmup | 1 | untimed warmup fills (JIT / pool / cache priming) |
bench.iterations | 3 | timed fills; best and mean are reported |
bench.batch | 1000 | JDBC batch size (DatabaseConfiguration.batchSize) |
bench.threads | 1 | worker threads; 1 = sequential baseline, >1 = parallel DataSource path |
bench.partitions | max(2, threads) | intra-table partitions for the single schema |
bench.rewriteBatched | true | Postgres driver batch rewrite; set false for the naive baseline |
bench.commit | none | commit strategy for single: none, perTable, or everyN:K |
bench.rows | 50000 | default row count for the wide/single schema (per table) |
bench.warehouses | 1 | TPC-C scale factor (W) |
bench.items | 10000 | TPC-C items (I) |
bench.districts | 10 | TPC-C districts per warehouse (D) |
bench.customers | 300 | TPC-C customers/orders per district (C) |
bench.minLines / bench.maxLines | 5 / 15 | TPC-C order lines per order |
The defaults produce a modest (~60k-row TPC-C / ~500k-row wide) dataset that runs quickly; scale them up for a real large-dataset measurement.
Recording a baseline
Section titled “Recording a baseline”Run on a quiet machine and record the environment alongside the numbers — JMH scores and fill throughput are only comparable within the same hardware/JDK/DB image. Capture the machine/CPU, the JDK, and the Docker/DB image next to the figures, exactly as the recorded results below do.
Fill throughput: stacking the optimizations
Section titled “Fill throughput: stacking the optimizations”DatabaseFiller.fill() applies four independent optimizations, each aimed at a different
bottleneck — the JDBC wire, transaction overhead, single-table parallelism, and the per-cell hot
loop. Because they target different costs, they stack: on one large table they compound to a
6.08× speedup, and on a wide multi-table schema parallel fill alone delivers 3.26×. The
two views below quantify that — a cumulative progression on a single dominant table, and a
cross-schema comparison of parallel table fill. The levers:
- driver batch rewrite —
reWriteBatchedInsertscollapses each JDBC batch into one multi-rowINSERT, cutting round-trips to the database. - commit strategy — disable autocommit and commit once per table (or every N batches) instead of paying transaction overhead per batch; see Commit strategy.
- intra-table partitioning — split one large table’s rows across workers. This is the only lever for a single dominant table (one table, one topological level), which parallel table fill cannot speed up. See Intra-table partitioning.
- parallel table fill —
DataSource+threads, one connection per worker; the lever for schemas with many independent tables. See Parallel fill — topological levels. - hot-loop micro-opt — positional generator dispatch in
TableFiller, which removes a HashMap lookup from the innermost loop (a clear CPU win in isolation; within end-to-end noise — see the CPU table).
Environment
- Machine / CPU: Apple M5 (Mac17,3), 10 cores
- JDK: Temurin 25.0.3+9 LTS
- Docker / DB image: OrbStack (Docker 29.4.0);
postgres:18-alpine,mysql:9(Testcontainers default),cockroachdb - Harness:
bench.batch=1000;bench.rows=1000000forwide/single, TPC-C at default cardinalities (~62k rows). Cross-schema figures are best of 3 (bench.warmup=1,bench.iterations=3); the single-table progression is best of 5 (bench.warmup=2,bench.iterations=5).
Removing the per-cell lookup (CPU, JMH, ops/us — higher is better)
Section titled “Removing the per-cell lookup (CPU, JMH, ops/us — higher is better)”The hot-loop micro-opt replaces the per-cell HashMap.get(column) (which hashes the value-based
Column record on every cell) with a positional array read. dispatchRowIndexed runs the same
16-column row as dispatchRow, so the delta isolates exactly the lookup that was removed — the rest
is the unavoidable generate() work, which dominates a row containing a 256-char varchar and a jsonb
payload.
| Benchmark | Score | Notes |
|---|---|---|
RowDispatchBenchmark.dispatchRow (HashMap) | 0.281 | 16-column row, per-cell map lookup |
RowDispatchBenchmark.dispatchRowIndexed (array) | 0.328 | same row, positional dispatch — +16.7% |
(GeneratorBenchmark measures raw per-type generation and is unaffected by the dispatch change;
slowest types in the baseline run: VARCHAR_LONG 0.59, JSONB 2.2, UUID 8.7, NUMERIC 8.0 ops/us.)
Stacking the strategies on one large table
Section titled “Stacking the strategies on one large table”A single 1,000,000-row table (postgres/single) is the clearest place to watch the optimizations
compound: it sits alone in its topological level, so parallel table fill can’t touch it, yet it
benefits from every other lever. Starting from a naive fill (sequential, autocommit, no driver batch
rewrite) and adding one strategy at a time — best of 5:
| Step | Added | rows/sec | Cumulative |
|---|---|---|---|
| Naive | sequential, autocommit, no batch rewrite | 125,228 | 1.00× |
| + driver batch rewrite | reWriteBatchedInserts collapses each batch into one multi-row INSERT | 151,641 | 1.21× |
| + per-table commit | one commit instead of one per batch | 160,867 | 1.28× |
| + intra-table partitioning ×8 | the table’s rows split across 8 workers | 760,959 | 6.08× |
xychart-beta
title "postgres/single — cumulative fill throughput, 1M rows (rows/sec, higher is better)"
x-axis ["naive", "+batch rewrite", "+per-table commit", "+partitions x8"]
y-axis "rows / sec" 0 --> 800000
bar [125228, 151641, 160867, 760959]
Together these take a single large table from 125k to 761k rows/sec — a 6.08× speedup. Driver batch rewrite and intra-table partitioning do the heavy lifting; per-table commit is a small bump here — with batch rewrite on, a single table already commits cheaply — but it matters more on the many-table schemas below. Each step is reproducible from the harness knobs:
BASE="./mvnw -pl bloviate-benchmarks -am -Pbench test -Dtest='PostgresFillBenchmark#singleTable' -Dbench.rows=1000000"$BASE -Dbench.rewriteBatched=false -Dbench.commit=none -Dbench.threads=1 # naive$BASE -Dbench.rewriteBatched=true -Dbench.commit=none -Dbench.threads=1 # + batch rewrite$BASE -Dbench.rewriteBatched=true -Dbench.commit=perTable -Dbench.threads=1 # + per-table commit$BASE -Dbench.rewriteBatched=true -Dbench.commit=perTable -Dbench.threads=8 -Dbench.partitions=8 # + partitioningParallel table fill across schemas (best of 3)
Section titled “Parallel table fill across schemas (best of 3)”Where a schema has many independent tables, parallel table fill is the lever instead of intra-table
partitioning. Baseline is the sequential single-Connection fill; “parallel” is eight workers, one
connection each:
| Scenario | Rows | Baseline | Parallel (8) | Speedup |
|---|---|---|---|---|
postgres/wide | 1,000,000 | 126,984 | 414,126 | 3.26× |
postgres/tpcc | 61,983 | 75,520 | 83,694 | 1.11× |
mysql/tpcc | 61,983 | 40,043 | 57,973 | 1.45× |
cockroach/tpcc | 61,983 | 12,403 | 13,304 | 1.07× |
xychart-beta
title "Parallel table fill speedup vs baseline (x, higher is better)"
x-axis ["postgres/wide", "mysql/tpcc", "postgres/tpcc", "cockroach/tpcc"]
y-axis "speedup (x)" 0 --> 3.5
bar [3.26, 1.45, 1.11, 1.07]
Reading the numbers
wideis the headline for parallel table fill. Ten independent, FK-free tables sit in a single topological level, so eight workers fill them concurrently: 3.26× over the single-connection baseline. This is the case the parallel optimization targets.- TPC-C gains are modest by design. Its foreign keys form a deep, narrow dependency graph — most levels hold only one or two independent tables — so there is little to parallelize within a level; the barrier between levels caps the win. The improvement that remains comes mostly from committing once per table instead of per batch.
- CockroachDB is bound by Raft/commit latency, not client CPU, so neither change moves it much.
- The sequential micro-opt is within end-to-end noise. A real fill is dominated by value
generation and JDBC round-trips, not the per-cell lookup, so the
dispatchRowIndexedwin shows up clearly in JMH but not end-to-end. The optimization is still worth keeping: it is free and removes allocation/lookup from the hottest loop.
Note on reproducibility: the same config and seed produce identical data on every run, including
date/time/timestamp columns. PostgresParallelFillTest asserts a parallel fill reproduces the
sequential fill byte-for-byte across a TPC-C schema, comparing every column except the few that use
wall-clock time by design (e.g. the order-line delivery date), which are intentionally
non-deterministic. See Reproducibility — deterministic seeds from schema identity.
Switching the RNG
Section titled “Switching the RNG”Bloviate generates values with a java.util.random.RandomGenerator — the JDK general-purpose default
L64X128MixRandom (RandomGenerators.create(seed)) — rather than the legacy java.util.Random,
a 48-bit LCG with a documented statistical defect and synchronized methods. The payoff is twofold:
faster draws (no lock, a better algorithm) and higher statistical quality, with no new dependency. The
per-column seeding architecture is unchanged, so output stays reproducible; only the algorithm
changes. The numbers below compare the two RNGs on an identical benchmark harness.
Environment
- Machine / CPU: Apple M5 (Mac17,3), 10 cores
- JDK: Temurin 25.0.3+9 LTS
- Harness:
GeneratorBenchmark/RowDispatchBenchmark, JMH-f 1 -wi 3 -w 1 -i 5 -r 1(quick settings; error bars on the cleangenerate()rows are <1%). Higher is better.
Raw value generation (CPU, JMH, ops/us — higher is better)
Section titled “Raw value generation (CPU, JMH, ops/us — higher is better)”Raw value generation, old vs new RNG:
generate() | java.util.Random | L64X128MixRandom | Δ |
|---|---|---|---|
| INTEGER | 180.8 | 318.7 | +76% |
| BIGINT | 91.4 | 320.5 | +3.5× |
| DOUBLE | 91.6 | 344.9 | +3.8× |
| VARCHAR_SHORT (16) | 9.5 | 17.2 | +80% |
| VARCHAR_LONG (256) | 0.59 | 1.32 | +2.2× |
| UUID | 9.4 | 9.8 | +4% |
Integer/long/double draws benefit most — no lock and a better algorithm. UUID is dominated by the
16-byte nextBytes copy, so it barely moves. Strings improved too: SeededRandomUtils no longer
delegates to commons-lang3 RandomStringUtils (which only accepts a java.util.Random); the
alphabetic/numeric paths now draw directly from a fixed character pool, which removes the
draw-and-reject loop entirely.
Per-row dispatch (CPU, JMH, ops/us — higher is better)
Section titled “Per-row dispatch (CPU, JMH, ops/us — higher is better)”End-to-end per-row dispatch — RowDispatchBenchmark over a 16-column row, the real inner loop of
TableFiller.fill():
| Benchmark | java.util.Random | L64X128MixRandom | Δ |
|---|---|---|---|
dispatchRow (HashMap) | 0.313 | 0.607 | +94% |
dispatchRowIndexed (array) | 0.327 | 0.661 | +102% |
xychart-beta
title "Per-row generate() dispatch — 16-column row (ops/us, higher is better)"
x-axis ["Random (HashMap)", "L64X128 (HashMap)", "Random (array)", "L64X128 (array)"]
y-axis "ops / us" 0 --> 0.7
bar [0.313, 0.607, 0.327, 0.661]
Bottom line: ~2× faster per-row generation with better statistical quality, no new dependency, and unchanged reproducibility.