Introduction
What is this?
The FHIR TX Benchmark is an open, reproducible performance benchmark for FHIR terminology servers — servers that implement the FHIR $validate-code, $expand, $lookup, $translate, and related operations.
It measures throughput and latency across a standardised set of test cases, run on the same hardware, against the same loaded terminology data, in isolation from each other. This makes results comparable across servers and across time.
Why it matters
Terminology operations are on the critical path of almost every clinical data pipeline — validating codes on intake, expanding value sets for CDS, translating between code systems. A slow or unreliable terminology server degrades the entire stack.
Published vendor benchmarks are rare, methodology is usually opaque, and “works on my machine” comparisons are not reproducible. This benchmark exists to provide a neutral, open baseline.
What is tested
The tests here are selected for their performance interest — text search, large expansions, multiple filters, diverse code system sources — not for coverage breadth. This is not a conformance test; a server may excel at coverage and correctness across the full FHIR terminology API and still rank low here. For broad conformance testing, see the FHIR TX Ecosystem.
Tests cover the main FHIR terminology operations across the most widely used code systems in clinical data:
| Category | Operations |
|---|---|
| LK — Lookup | $lookup for SNOMED CT, LOINC, RxNorm, ICD-10 |
| VC — Validate code | $validate-code against common value sets |
| EX — Expand | $expand for small, medium, and large value sets |
| SS — Subsumption | $subsumes for hierarchies |
| CM — Concept map | $translate using standard cross-maps |
| FS — FHIR search | GET /CodeSystem, GET /ValueSet with filters |
Each test draws inputs randomly from a pool of 2,000+ entries to simulate realistic working-set conditions rather than testing the same code in a hot cache.
How to read results
Composite Score — A single 0–100 number summarising overall performance. The top server in each run scores 100; all others are a percentage of that. See the Scoring page for the formula.
wRPS — Weighted requests per second. Raw throughput adjusted so that naturally fast tests (simple lookups) don’t dominate the score over naturally slow ones (large expansions).
Preflight dots — Each coloured dot represents one test case. Green = passed correctness check and was benchmarked. Red = returned an incorrect response (excluded from scoring). Grey = not supported or returned 4xx/5xx.
Error rate — Fraction of benchmark requests that returned an HTTP error during load testing. Yellow = below 1%. Red = 1% or above.
Limitations
- Not a production proxy. The benchmark runs on loopback with no network, no auth, and a single client machine. Real-world performance depends heavily on network topology, hardware, caching strategies, and data volume.
- Snapshot in time. Results reflect a specific software version and dataset snapshot. Servers improve; check the run date.
- Coverage gaps. Not all FHIR operations and code systems are covered. Servers that support operations outside the test suite are not rewarded for that coverage. See the FHIR TX Ecosystem for broad conformance testing.
- Single node. No clustering or horizontal scaling is tested.