Skip to content

Benchmarks

Bloviate ships with two complementary performance suites: CPU micro-benchmarks that measure raw value generation in isolation (via JMH), and end-to-end fills that time DatabaseFiller.fill() against real databases running in containers. The CPU suite isolates the cost of generating values and dispatching them per cell; the end-to-end suite captures everything a real run pays for — metadata fetch, generation, batching, and JDBC round-trips. Every number on this page is reproducible: the benchmarks live in the (never-published) bloviate-benchmarks module, use fixed seeds so each iteration generates identical data, and record the exact hardware, JDK, and database image they were measured on.

The work being optimized spans two regimes, so there are two suites:

SuiteWhat it measuresWhere the wins show upTool
CPU micro-benchmarksraw value generation and per-cell generator dispatch, no databasehot-loop micro-opts, generator-level changesJMH
End-to-end fillDatabaseFiller.fill() throughput (rows/sec) against a real DBparallel table fill, commit tuning, batch rewriteplain JUnit runner

The bloviate-benchmarks module isn’t published to Maven Central and nothing in bloviate-core depends on it — running the benchmarks is entirely opt-in and adds no weight to the library.

Pure-CPU, no Docker. The benchmarks resolve generators exactly the way TableFiller does (DatabaseSupport.getDataGenerator(column, random)), so the numbers reflect real engine cost.

  • GeneratorBenchmark — throughput of generate() / generateAsString() across a spread of column types (int, numeric, varchar, timestamp, uuid, jsonb, …).
  • RowDispatchBenchmark — models the inner loop of TableFiller.fill(): the per-cell generatorMap.get(column) HashMap lookup plus generate() over a wide row. This is the baseline for the “index generators by array position” change.

Build the runnable uber-jar and run it:

Terminal window
./mvnw -q -DskipTests -pl bloviate-benchmarks -am package
java -jar bloviate-benchmarks/target/benchmarks.jar

Useful invocations (standard JMH CLI — flags take a space):

Terminal window
# one benchmark, quick
java -jar bloviate-benchmarks/target/benchmarks.jar RowDispatchBenchmark
# restrict GeneratorBenchmark to specific column types
java -jar bloviate-benchmarks/target/benchmarks.jar GeneratorBenchmark.generate \
-p genCase=UUID,VARCHAR_SHORT
# fewer forks/iterations while iterating locally
java -jar bloviate-benchmarks/target/benchmarks.jar -f 1 -wi 2 -i 3

Opt-in JUnit runners tagged @Tag("benchmark"), one per database, reusing the same TestContainers stack as the core tests. They are skipped by a normal build and only run under the bench profile. Requires Docker (OrbStack works).

Each measured iteration truncates the schema, times a full DatabaseFiller.fill() (metadata fetch included), and prints rows/sec. A fixed seed makes every iteration generate identical data, so timings are comparable across runs and across optimization branches.

  • PostgresFillBenchmark — TPC-C schema, a deliberately wide FK-free schema (create_wide.postgres.sql, the between-table parallel-fill target), and a single dominant table (create_single.postgres.sql, the intra-table partitioning target).
  • MySqlFillBenchmark, CockroachFillBenchmark — TPC-C schema.
Terminal window
# all end-to-end benchmarks (Postgres + MySQL + CockroachDB)
./mvnw -pl bloviate-benchmarks -am -Pbench test
# just Postgres, just the wide schema, larger dataset
./mvnw -pl bloviate-benchmarks -am -Pbench test \
-Dtest='PostgresFillBenchmark#wide' -Dbench.rows=500000
# intra-table partitioning: one big table, sequential baseline vs 8-way partitioned
./mvnw -pl bloviate-benchmarks -am -Pbench test \
-Dtest='PostgresFillBenchmark#singleTable' -Dbench.rows=2000000 -Dbench.threads=1
./mvnw -pl bloviate-benchmarks -am -Pbench test \
-Dtest='PostgresFillBenchmark#singleTable' -Dbench.rows=2000000 -Dbench.threads=8

Output lines look like:

[bench] postgres/wide iteration 1 500,000 rows 3.612 s 138,430 rows/s
[bench] postgres/wide best 500,000 rows 3.580 s 139,664 rows/s
[bench] postgres/wide mean 500,000 rows 3.601 s 138,850 rows/s
PropertyDefaultMeaning
bench.warmup1untimed warmup fills (JIT / pool / cache priming)
bench.iterations3timed fills; best and mean are reported
bench.batch1000JDBC batch size (DatabaseConfiguration.batchSize)
bench.threads1worker threads; 1 = sequential baseline, >1 = parallel DataSource path
bench.partitionsmax(2, threads)intra-table partitions for the single schema
bench.rewriteBatchedtruePostgres driver batch rewrite; set false for the naive baseline
bench.commitnonecommit strategy for single: none, perTable, or everyN:K
bench.rows50000default row count for the wide/single schema (per table)
bench.warehouses1TPC-C scale factor (W)
bench.items10000TPC-C items (I)
bench.districts10TPC-C districts per warehouse (D)
bench.customers300TPC-C customers/orders per district (C)
bench.minLines / bench.maxLines5 / 15TPC-C order lines per order

The defaults produce a modest (~60k-row TPC-C / ~500k-row wide) dataset that runs quickly; scale them up for a real large-dataset measurement.

Run on a quiet machine and record the environment alongside the numbers — JMH scores and fill throughput are only comparable within the same hardware/JDK/DB image. Capture the machine/CPU, the JDK, and the Docker/DB image next to the figures, exactly as the recorded results below do.

Fill throughput: stacking the optimizations

Section titled “Fill throughput: stacking the optimizations”

DatabaseFiller.fill() applies four independent optimizations, each aimed at a different bottleneck — the JDBC wire, transaction overhead, single-table parallelism, and the per-cell hot loop. Because they target different costs, they stack: on one large table they compound to a 6.08× speedup, and on a wide multi-table schema parallel fill alone delivers 3.26×. The two views below quantify that — a cumulative progression on a single dominant table, and a cross-schema comparison of parallel table fill. The levers:

  • driver batch rewritereWriteBatchedInserts collapses each JDBC batch into one multi-row INSERT, cutting round-trips to the database.
  • commit strategy — disable autocommit and commit once per table (or every N batches) instead of paying transaction overhead per batch; see Commit strategy.
  • intra-table partitioning — split one large table’s rows across workers. This is the only lever for a single dominant table (one table, one topological level), which parallel table fill cannot speed up. See Intra-table partitioning.
  • parallel table fillDataSource + threads, one connection per worker; the lever for schemas with many independent tables. See Parallel fill — topological levels.
  • hot-loop micro-opt — positional generator dispatch in TableFiller, which removes a HashMap lookup from the innermost loop (a clear CPU win in isolation; within end-to-end noise — see the CPU table).

Environment

  • Machine / CPU: Apple M5 (Mac17,3), 10 cores
  • JDK: Temurin 25.0.3+9 LTS
  • Docker / DB image: OrbStack (Docker 29.4.0); postgres:18-alpine, mysql:9 (Testcontainers default), cockroachdb
  • Harness: bench.batch=1000; bench.rows=1000000 for wide/single, TPC-C at default cardinalities (~62k rows). Cross-schema figures are best of 3 (bench.warmup=1, bench.iterations=3); the single-table progression is best of 5 (bench.warmup=2, bench.iterations=5).

Removing the per-cell lookup (CPU, JMH, ops/us — higher is better)

Section titled “Removing the per-cell lookup (CPU, JMH, ops/us — higher is better)”

The hot-loop micro-opt replaces the per-cell HashMap.get(column) (which hashes the value-based Column record on every cell) with a positional array read. dispatchRowIndexed runs the same 16-column row as dispatchRow, so the delta isolates exactly the lookup that was removed — the rest is the unavoidable generate() work, which dominates a row containing a 256-char varchar and a jsonb payload.

BenchmarkScoreNotes
RowDispatchBenchmark.dispatchRow (HashMap)0.28116-column row, per-cell map lookup
RowDispatchBenchmark.dispatchRowIndexed (array)0.328same row, positional dispatch — +16.7%

(GeneratorBenchmark measures raw per-type generation and is unaffected by the dispatch change; slowest types in the baseline run: VARCHAR_LONG 0.59, JSONB 2.2, UUID 8.7, NUMERIC 8.0 ops/us.)

Stacking the strategies on one large table

Section titled “Stacking the strategies on one large table”

A single 1,000,000-row table (postgres/single) is the clearest place to watch the optimizations compound: it sits alone in its topological level, so parallel table fill can’t touch it, yet it benefits from every other lever. Starting from a naive fill (sequential, autocommit, no driver batch rewrite) and adding one strategy at a time — best of 5:

StepAddedrows/secCumulative
Naivesequential, autocommit, no batch rewrite125,2281.00×
+ driver batch rewritereWriteBatchedInserts collapses each batch into one multi-row INSERT151,6411.21×
+ per-table commitone commit instead of one per batch160,8671.28×
+ intra-table partitioning ×8the table’s rows split across 8 workers760,9596.08×
xychart-beta
    title "postgres/single — cumulative fill throughput, 1M rows (rows/sec, higher is better)"
    x-axis ["naive", "+batch rewrite", "+per-table commit", "+partitions x8"]
    y-axis "rows / sec" 0 --> 800000
    bar [125228, 151641, 160867, 760959]

Together these take a single large table from 125k to 761k rows/sec — a 6.08× speedup. Driver batch rewrite and intra-table partitioning do the heavy lifting; per-table commit is a small bump here — with batch rewrite on, a single table already commits cheaply — but it matters more on the many-table schemas below. Each step is reproducible from the harness knobs:

Terminal window
BASE="./mvnw -pl bloviate-benchmarks -am -Pbench test -Dtest='PostgresFillBenchmark#singleTable' -Dbench.rows=1000000"
$BASE -Dbench.rewriteBatched=false -Dbench.commit=none -Dbench.threads=1 # naive
$BASE -Dbench.rewriteBatched=true -Dbench.commit=none -Dbench.threads=1 # + batch rewrite
$BASE -Dbench.rewriteBatched=true -Dbench.commit=perTable -Dbench.threads=1 # + per-table commit
$BASE -Dbench.rewriteBatched=true -Dbench.commit=perTable -Dbench.threads=8 -Dbench.partitions=8 # + partitioning

Parallel table fill across schemas (best of 3)

Section titled “Parallel table fill across schemas (best of 3)”

Where a schema has many independent tables, parallel table fill is the lever instead of intra-table partitioning. Baseline is the sequential single-Connection fill; “parallel” is eight workers, one connection each:

ScenarioRowsBaselineParallel (8)Speedup
postgres/wide1,000,000126,984414,1263.26×
postgres/tpcc61,98375,52083,6941.11×
mysql/tpcc61,98340,04357,9731.45×
cockroach/tpcc61,98312,40313,3041.07×
xychart-beta
    title "Parallel table fill speedup vs baseline (x, higher is better)"
    x-axis ["postgres/wide", "mysql/tpcc", "postgres/tpcc", "cockroach/tpcc"]
    y-axis "speedup (x)" 0 --> 3.5
    bar [3.26, 1.45, 1.11, 1.07]

Reading the numbers

  • wide is the headline for parallel table fill. Ten independent, FK-free tables sit in a single topological level, so eight workers fill them concurrently: 3.26× over the single-connection baseline. This is the case the parallel optimization targets.
  • TPC-C gains are modest by design. Its foreign keys form a deep, narrow dependency graph — most levels hold only one or two independent tables — so there is little to parallelize within a level; the barrier between levels caps the win. The improvement that remains comes mostly from committing once per table instead of per batch.
  • CockroachDB is bound by Raft/commit latency, not client CPU, so neither change moves it much.
  • The sequential micro-opt is within end-to-end noise. A real fill is dominated by value generation and JDBC round-trips, not the per-cell lookup, so the dispatchRowIndexed win shows up clearly in JMH but not end-to-end. The optimization is still worth keeping: it is free and removes allocation/lookup from the hottest loop.

Note on reproducibility: the same config and seed produce identical data on every run, including date/time/timestamp columns. PostgresParallelFillTest asserts a parallel fill reproduces the sequential fill byte-for-byte across a TPC-C schema, comparing every column except the few that use wall-clock time by design (e.g. the order-line delivery date), which are intentionally non-deterministic. See Reproducibility — deterministic seeds from schema identity.

Bloviate generates values with a java.util.random.RandomGenerator — the JDK general-purpose default L64X128MixRandom (RandomGenerators.create(seed)) — rather than the legacy java.util.Random, a 48-bit LCG with a documented statistical defect and synchronized methods. The payoff is twofold: faster draws (no lock, a better algorithm) and higher statistical quality, with no new dependency. The per-column seeding architecture is unchanged, so output stays reproducible; only the algorithm changes. The numbers below compare the two RNGs on an identical benchmark harness.

Environment

  • Machine / CPU: Apple M5 (Mac17,3), 10 cores
  • JDK: Temurin 25.0.3+9 LTS
  • Harness: GeneratorBenchmark / RowDispatchBenchmark, JMH -f 1 -wi 3 -w 1 -i 5 -r 1 (quick settings; error bars on the clean generate() rows are <1%). Higher is better.

Raw value generation (CPU, JMH, ops/us — higher is better)

Section titled “Raw value generation (CPU, JMH, ops/us — higher is better)”

Raw value generation, old vs new RNG:

generate()java.util.RandomL64X128MixRandomΔ
INTEGER180.8318.7+76%
BIGINT91.4320.5+3.5×
DOUBLE91.6344.9+3.8×
VARCHAR_SHORT (16)9.517.2+80%
VARCHAR_LONG (256)0.591.32+2.2×
UUID9.49.8+4%

Integer/long/double draws benefit most — no lock and a better algorithm. UUID is dominated by the 16-byte nextBytes copy, so it barely moves. Strings improved too: SeededRandomUtils no longer delegates to commons-lang3 RandomStringUtils (which only accepts a java.util.Random); the alphabetic/numeric paths now draw directly from a fixed character pool, which removes the draw-and-reject loop entirely.

Per-row dispatch (CPU, JMH, ops/us — higher is better)

Section titled “Per-row dispatch (CPU, JMH, ops/us — higher is better)”

End-to-end per-row dispatch — RowDispatchBenchmark over a 16-column row, the real inner loop of TableFiller.fill():

Benchmarkjava.util.RandomL64X128MixRandomΔ
dispatchRow (HashMap)0.3130.607+94%
dispatchRowIndexed (array)0.3270.661+102%
xychart-beta
    title "Per-row generate() dispatch — 16-column row (ops/us, higher is better)"
    x-axis ["Random (HashMap)", "L64X128 (HashMap)", "Random (array)", "L64X128 (array)"]
    y-axis "ops / us" 0 --> 0.7
    bar [0.313, 0.607, 0.327, 0.661]

Bottom line: ~2× faster per-row generation with better statistical quality, no new dependency, and unchanged reproducibility.