Introduction

What is this?

The FHIR TX Benchmark is an open, reproducible performance benchmark for FHIR terminology servers — servers that implement the FHIR $validate-code, $expand, $lookup, $translate, and related operations.

It measures throughput and latency across a standardised set of test cases, run on the same hardware, against the same loaded terminology data, in isolation from each other. This makes results comparable across servers and across time.

Why it matters

Terminology operations are on the critical path of almost every clinical data pipeline — validating codes on intake, expanding value sets for CDS, translating between code systems. A slow or unreliable terminology server degrades the entire stack.

Published vendor benchmarks are rare, methodology is usually opaque, and “works on my machine” comparisons are not reproducible. This benchmark exists to provide a neutral, open baseline.

What is tested

The tests here are selected for their performance interest — text search, large expansions, multiple filters, diverse code system sources — not for coverage breadth. This is not a conformance test; a server may excel at coverage and correctness across the full FHIR terminology API and still rank low here. For broad conformance testing, see the FHIR TX Ecosystem.

Tests cover the main FHIR terminology operations across the most widely used code systems in clinical data:

CategoryOperations
LK — Lookup$lookup for SNOMED CT, LOINC, RxNorm, ICD-10
VC — Validate code$validate-code against common value sets
EX — Expand$expand for small, medium, and large value sets
SS — Subsumption$subsumes for hierarchies
CM — Concept map$translate using standard cross-maps
FS — FHIR searchGET /CodeSystem, GET /ValueSet with filters

Each test draws inputs randomly from a pool of 2,000+ entries to simulate realistic working-set conditions rather than testing the same code in a hot cache.

How to read results

Composite Score — A single 0–100 number summarising overall performance. The top server in each run scores 100; all others are a percentage of that. See the Scoring page for the formula.

wRPS — Weighted requests per second. Raw throughput adjusted so that naturally fast tests (simple lookups) don’t dominate the score over naturally slow ones (large expansions).

Preflight dots — Each coloured dot represents one test case. Green = passed correctness check and was benchmarked. Red = returned an incorrect response (excluded from scoring). Grey = not supported or returned 4xx/5xx.

Error rate — Fraction of benchmark requests that returned an HTTP error during load testing. Yellow = below 1%. Red = 1% or above.

Limitations

  • Not a production proxy. The benchmark runs on loopback with no network, no auth, and a single client machine. Real-world performance depends heavily on network topology, hardware, caching strategies, and data volume.
  • Snapshot in time. Results reflect a specific software version and dataset snapshot. Servers improve; check the run date.
  • Coverage gaps. Not all FHIR operations and code systems are covered. Servers that support operations outside the test suite are not rewarded for that coverage. See the FHIR TX Ecosystem for broad conformance testing.
  • Single node. No clustering or horizontal scaling is tested.