Methodology

Overview

Each server is benchmarked in isolation: one server running at a time, on the same host as the k6 load generator, using loopback only — no network latency. All servers are loaded with the same terminology datasets before benchmarking begins.

Servers

All servers run as Docker containers, either from their official published image or containerised for this benchmark. Containers are named after their server identifier so that cAdvisor metrics can be correlated with k6 results by container name.

For per-server configuration (version, runtime, and any non-default settings) see: Servers

If a server crashes between tests, it is restarted and the benchmark continues from the next test (or VU level).

Data

All servers are pre-loaded with the same terminology data before benchmarking:

SNOMED CT International, UK, and US editions
LOINC
RxNorm
FHIR packages: hl7.fhir.r4.core, hl7.terminology.r4, hl7.fhir.us.vsac, hl7.fhir.uv.ips, and others

See the Data page for dataset versions, licenses, and load instructions.

Tests

The benchmark covers 20 test cases. Each test draws inputs at random from a pool of entries per scenario, simulating a real-world scenario rather than a hot-cache hit on a single code.

See the Tests page for the full list.

Preflight

Before benchmarking, each test is run once against the server with a known input. The response is checked for semantic correctness — right resource type, expected fields, correct values.

Tests that fail this check are excluded from that server’s benchmark run and recorded in the capability matrix. Excluded tests are handled via imputation in the scoring step rather than being counted as zero.

Benchmark

Only preflight-passing tests are benchmarked. Each test is run at three concurrency levels using k6:

Level	Virtual users	Duration
Low	1 VU	30 s
Mid	10 VUs	30 s
High	50 VUs	30 s

A warm-up pass (5 s at 10 VUs, results discarded) runs before measurement to allow JIT compilers and connection pools to reach steady state.

Each VU runs a tight request loop for the duration: pick a random entry from the pool, issue the request, record the result. There is no think time between requests.

Metrics collection

Latency and throughput are measured by k6 directly: http_req_duration percentiles (p50, p95, p99) and request rate per second.

Memory is measured via cAdvisor, which exposes per-container metrics at 1-second resolution. Two snapshots are recorded per server:

Idle — taken before the benchmark run starts, after the server has finished loading data
Peak — max_over_time of container_memory_usage_bytes across the full benchmark run

Container names match the server identifier so cAdvisor metrics can be joined with k6 results without manual mapping.

See the Scoring page for how these measurements are combined into a composite score.