Health Checks
The health module provides a health check registry with concurrent execution and FastAPI integration for liveness, readiness, and aggregate health endpoints.
- HealthRegistry: Register check functions with a FastAPI-style decorator, run them concurrently with per-check timeouts and caching.
- health_router: FastAPI router with
/livez,/readyz, and/healthzendpoints.
Health Check
A health check is a function returning None (healthy) or a HealthDetails dict (also healthy, with metrics attached). Raising an exception signals failure.
from grelmicro.health import HealthDetails, HealthRegistry
from grelmicro.health.errors import HealthError
health = HealthRegistry()
# Decorator form: register an async function under a name
@health.check("database")
async def check_database() -> HealthDetails | None:
# Return None on success (healthy, no details)
return None
@health.check("redis")
async def check_redis() -> HealthDetails | None:
# Return a dict to include details (e.g. metrics)
return {"latency_ms": 1.2, "version": "7.2"}
@health.check("external-api", critical=False)
async def check_external_api() -> HealthDetails | None:
# Raise HealthError to expose a specific message in the error field.
# Other exceptions produce a generic "Health check failed" message.
msg = "Connection refused"
raise HealthError(msg)
- Return
None: healthy, no details. - Return a
HealthDetailsdict: healthy, with details. Values can be primitives,datetime, nested dicts, lists, or tuples. - Raise
HealthError: unhealthy. The exception message appears in theerrorfield. - Raise any other exception: unhealthy, with a generic
"Health check failed"message. The traceback is logged server-side to avoid leaking internal information.
HealthDetails is a type alias for dict[str, JSONEncodable]. Both sync and async check functions are supported. Sync functions run in a worker thread via anyio.to_thread.run_sync so they never block the event loop.
Registry
Create a HealthRegistry and register checks with the @registry.check(name) decorator:
from grelmicro.health import HealthDetails, HealthRegistry
# Create the health (auto-registers as the global singleton)
health = HealthRegistry()
# Register checks with the @health.check(name) decorator
@health.check("database")
async def check_database() -> HealthDetails | None:
# Return None on success, raise on failure
return None
@health.check("redis")
async def check_redis() -> HealthDetails | None:
# Return a dict to include details
return {"latency_ms": 1.2}
# Optional dependency: mark non-critical so its failure doesn't
# take the instance out of the load balancer.
@health.check("external-api", critical=False)
async def check_external_api() -> HealthDetails | None:
return None
The registry auto-registers as the global singleton. The router resolves it automatically.
For imperative registration (without a decorator), use registry.add(name, func):
from grelmicro.health import HealthDetails, HealthRegistry
health = HealthRegistry()
async def check_kafka() -> HealthDetails | None:
return None
health.add("kafka", check_kafka, critical=True, timeout=2.0)
Critical vs Non-Critical
By default, all checks are critical: a failure flips the aggregate to error and returns 503 on /readyz and /healthz. Pass critical=False for optional dependencies:
from grelmicro.health import HealthDetails, HealthRegistry
health = HealthRegistry()
@health.check("external-api", critical=False)
async def check_external_api() -> HealthDetails | None:
return None
| Scenario | Aggregate status on /healthz |
/readyz |
/healthz |
|---|---|---|---|
| All critical pass | ok |
200 |
200 |
| Non-critical failed | ok |
200 |
200 |
| Critical failed | error |
503 |
503 |
Non-critical checks do not pull the instance from the load balancer. They are not run on /readyz (which runs critical only). Their status appears per-check in the /healthz body so operators and dashboards can see degraded dependencies without triggering traffic removal.
Timeout
Checks that exceed their timeout are reported as failed:
from grelmicro.health import HealthDetails, HealthRegistry
# Registry default: 2s per check (default is 5s)
health = HealthRegistry(timeout=2.0)
# Per-check override: tight timeout for a flaky optional dep
@health.check("analytics", critical=False, timeout=0.5)
async def check_analytics() -> HealthDetails | None:
return None
The registry has a global default timeout (5.0 seconds). Per-check overrides are set on registration:
from grelmicro.health import HealthDetails, HealthRegistry
health = HealthRegistry()
@health.check("slow-api", critical=False, timeout=0.5)
async def check_slow_api() -> HealthDetails | None:
return None
A slow non-critical check hits the timeout and is reported with status: "error" in the response body, but the aggregate stays ok and /readyz stays 200.
Timeout detection uses anyio.move_on_after. It correctly separates registry timeouts from a TimeoutError raised inside the check itself (for example a socket timeout).
Caching
The registry caches each check's result for cache_ttl seconds (default 1.0) and coalesces concurrent calls via single-flight per check. A given check runs at most once per TTL regardless of how many endpoints or concurrent requests are in flight. This prevents probe traffic from amplifying onto your database.
from grelmicro.health import HealthRegistry
# Default: 1-second TTL with single-flight per check
HealthRegistry(timeout=5.0, cache_ttl=1.0)
# Disable caching entirely
HealthRegistry(cache_ttl=0)
Returning Details with an Error
Raise HealthError(message, details=...) to attach a diagnostic payload to a failing check. The payload appears under details on the check entry, subject to show_details:
from grelmicro.health import HealthDetails, HealthRegistry
from grelmicro.health.errors import HealthError
health = HealthRegistry()
@health.check("database")
async def check_database() -> HealthDetails | None:
# Simulate a failure with a diagnostic payload
msg = "connection pool exhausted"
raise HealthError(msg, details={"active": 10, "idle": 0, "max": 10})
FastAPI Integration
Add health endpoints to your FastAPI app:
from collections.abc import AsyncIterator
from contextlib import asynccontextmanager
from fastapi import FastAPI
from grelmicro.health import HealthDetails, HealthRegistry
from grelmicro.health.fastapi import health_router
health = HealthRegistry()
@health.check("database")
async def check_database() -> HealthDetails | None:
return None
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncIterator[None]:
yield
app = FastAPI(lifespan=lifespan)
app.include_router(health_router())
# Endpoints: GET /livez, GET /readyz, GET /healthz
This creates three endpoints:
| Endpoint | Purpose | Success | Failure | Body |
|---|---|---|---|---|
GET /livez |
Liveness probe. Never runs checks. | 200 |
no response (timeout) | empty |
GET /readyz |
Readiness probe. Runs critical checks only. | 200 |
503 |
empty |
GET /healthz |
Aggregate JSON report for humans and dashboards. Runs all checks. | 200 |
503 |
JSON {status, checks} |
All three also accept HEAD. All responses set Cache-Control: no-store. Probe endpoints return an empty body. The HTTP status code is the entire signal.
Paths follow the z-pages convention (/livez, /readyz, /healthz). The trailing z avoids collisions with application routes like /health.
Using with Docker, Compose, and other Orchestrators
Different orchestrators consume different probes. grelmicro exposes all three endpoints, pick the ones that fit:
| Orchestrator | Uses |
|---|---|
| Kubernetes, OpenShift | livenessProbe → /livez, readinessProbe → /readyz, startupProbe → /livez |
Docker (HEALTHCHECK), Docker Compose, Docker Swarm |
single healthcheck → /livez (restart on failure) |
Nomad (check stanza) |
/livez for liveness, /readyz for routing |
systemd (WatchdogSec) |
/livez (restart on failure) |
| Reverse proxies (Traefik, nginx, HAProxy, Envoy), load balancers (AWS ALB, GCP LB) | /readyz for upstream health |
| Dashboards, uptime monitors (Prometheus, Pingdom) | /healthz for full report |
Docker Compose example:
services:
app:
image: myapp
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/livez"]
interval: 10s
timeout: 2s
retries: 3
Docker and Docker Swarm only model a single healthcheck per container, so point it at /livez. Route readiness through a reverse proxy that probes /readyz.
Exclude Checks
Both /readyz and /healthz accept an ?exclude query parameter with a comma-separated list of check names. Useful for temporarily muting a known-flaky check without redeploy:
GET /readyz?exclude=analytics,recommendations
GET /healthz?exclude=analytics
Excluded checks are not run and do not appear in the response.
Verbose Details
Each check can return verbose metadata (versions, internal hostnames, pool stats, latencies). Exposing it publicly on /healthz leaks infrastructure fingerprints to anyone who can reach the endpoint, so by default grelmicro strips every details field from the response.
The show_details parameter controls visibility with three forms:
show_details |
Who sees details | Use when |
|---|---|---|
False (default) |
nobody | Safe default for a public /healthz |
True |
everyone | /healthz is on a private network only |
Depends(fn) |
requests for which fn() returns True |
Public /healthz, admin-only details |
With show_details=Depends(fn), fn is wired into FastAPI's dependency-injection graph. Return True to include details for that request, False to strip them. Everything FastAPI supports works: Depends sub-dependencies, Security, Request injection, async functions, yield cleanup. Returning False strips details but the endpoint still returns 200/503 with status and check names. This way, uptime monitors without credentials still get actionable aggregate status while admin tools with credentials get the full payload. Raising HTTPException blocks the endpoint, so prefer returning False for a soft strip.
from ipaddress import ip_address
from fastapi import Depends, Request
from grelmicro.health.fastapi import health_router
def from_private_network(request: Request) -> bool:
return bool(request.client and ip_address(request.client.host).is_private)
# Hide details from everyone (default).
router = health_router()
# Show details to everyone (private /healthz only).
router = health_router(show_details=True)
# Show details when the dependency returns True.
router = health_router(show_details=Depends(from_private_network))
With details enabled, each check entry includes a details field:
{
"status": "ok",
"checks": {
"redis": {
"status": "ok",
"critical": true,
"error": null,
"details": {"latency_ms": 1.2, "version": "7.2"}
}
}
}
show_details vs healthz_dependencies
Two independent gates sit in front of /healthz:
| Parameter | Failure effect | Typical use |
|---|---|---|
healthz_dependencies |
Blocks the entire endpoint (401/403) |
Keep /healthz entirely private |
show_details=Depends(fn) |
When fn() returns False, strips the details field. Endpoint still returns 200/503 |
Keep aggregate status public, hide verbose metadata |
For stricter setups, gate the whole /healthz endpoint behind authentication while leaving /livez and /readyz open (most orchestrators and load balancers cannot carry credentials):
from fastapi import Depends, Request
from grelmicro.health.fastapi import health_router
def require_admin(request: Request) -> None:
# Replace with your real auth. Raise HTTPException on failure.
return None
# Gate /healthz behind auth. /livez and /readyz remain open for
# orchestrators and load balancers.
router = health_router(
show_details=True,
healthz_dependencies=[Depends(require_admin)],
)
URL Prefix
Mount the health endpoints under a custom prefix:
from fastapi import FastAPI
from grelmicro.health.fastapi import health_router
app = FastAPI()
app.include_router(health_router(prefix="/api/v1"))
# Endpoints: GET /api/v1/livez, GET /api/v1/readyz, GET /api/v1/healthz
Design
Why Three Endpoints
Each endpoint serves a different audience:
| Endpoint | Audience | Answers |
|---|---|---|
/livez |
Orchestrator (Kubernetes, Docker, Nomad, systemd) | Is the process alive? Should it be restarted? |
/readyz |
Load balancer, reverse proxy, service mesh | Can this instance serve traffic? |
/healthz |
Operators, dashboards, uptime monitors | What is the state of each component? |
Liveness never checks dependencies. A failing database must never restart your container. Readiness runs all critical checks concurrently. If any fails, the instance is removed from the load balancer. Aggregate also runs non-critical checks and returns a JSON report for humans. Probe bodies stay empty to keep the wire minimal and the signal unambiguous. The HTTP status code is the only thing orchestrators read.
Status Vocabulary
Binary status, used for both components and the aggregate:
| Status | Meaning |
|---|---|
ok |
The check passed. At the aggregate level: every critical check passed. |
error |
The check failed. At the aggregate level: at least one critical check failed. |
Non-critical failures produce status: "error" on the individual check but do not flip the aggregate. The aggregate only goes to error when at least one critical check fails.
Function-Based API
Checks are plain functions. No base class to inherit from.
from grelmicro.health import HealthDetails, HealthRegistry
from grelmicro.health.errors import HealthError
health = HealthRegistry()
# Decorator form: register an async function under a name
@health.check("database")
async def check_database() -> HealthDetails | None:
# Return None on success (healthy, no details)
return None
@health.check("redis")
async def check_redis() -> HealthDetails | None:
# Return a dict to include details (e.g. metrics)
return {"latency_ms": 1.2, "version": "7.2"}
@health.check("external-api", critical=False)
async def check_external_api() -> HealthDetails | None:
# Raise HealthError to expose a specific message in the error field.
# Other exceptions produce a generic "Health check failed" message.
msg = "Connection refused"
raise HealthError(msg)
This mirrors FastAPI's own @app.get("/path") routing style and keeps grelmicro small. Return types are statically checked. JSONEncodable (recursive) includes primitives, datetime, nested dict / list / tuple, and any Mapping subclass. Type checkers (mypy, ty) catch non-serializable returns like bytes or custom objects at authoring time.
Concurrent Execution
All selected checks run in parallel via an anyio task group. A slow check does not block other checks. Each check runs with its own timeout (falling back to the registry default).