Skip to content

Retry

A Retry policy repeats a failing call with backoff and jitter. Use it for transient errors that are safe to retry: network blips, brief overloads, deadline-related failures.

Why

  • Survive flaky network paths without changing the call site.
  • Apply exponential backoff with jitter so retries do not stampede a recovering dependency.
  • Stay safe by default: retries trigger only on the exception classes you allow.

Usage

import httpx

from grelmicro.resilience import retry


@retry(when=httpx.HTTPError, attempts=3)
async def fetch(client: httpx.AsyncClient, url: str) -> bytes:
    response = await client.get(url)
    response.raise_for_status()
    return response.content


async def main() -> bytes:
    async with httpx.AsyncClient() as client:
        return await fetch(client, "https://example.com")

The decorator works on async and sync functions. It auto-detects which kind it wraps.

For inline retries that span multiple statements, use the block form:

import httpx

from grelmicro.resilience import retrying


async def submit(client: httpx.AsyncClient, url: str, payload: dict) -> dict:
    async for attempt in retrying(when=httpx.HTTPError, attempts=3):
        async with attempt:
            response = await client.post(url, json=payload)
            response.raise_for_status()
            return response.json()
    return {}


async def main() -> dict:
    async with httpx.AsyncClient() as client:
        return await submit(client, "https://example.com", {"k": "v"})

when= is required. There is no default. Pass a Match instance, or one of the shorthand forms (an exception class, a tuple of classes, or a predicate callable). See the next section for the full filter DSL.

Filtering outcomes with Match

Match is the DSL every resilience strategy uses to decide whether an outcome (an exception OR a return value) should engage the strategy. The when= parameter on Retry accepts any Match instance, plus the bare-class shorthand for the simple case.

Exception filtering

from grelmicro.resilience import Match, Retry

Retry("api", when=Match.exception(httpx.HTTPError))
Retry("api", when=Match.exception(httpx.HTTPError, OSError))
Retry("api", when=Match.exception(lambda e: e.status >= 500))
Retry("api", when=Match.exception_message(contains="timeout"))
Retry("api", when=Match.exception_message(regex=r"^5\d\d "))
Retry("api", when=Match.exception_cause(KeyError))

Result filtering

Retry("polling", when=Match.result(None))
Retry("polling", when=Match.result(False))
Retry("polling", when=Match.result(lambda r: r.status_code >= 500))

Match.result(callable) always treats the argument as a predicate. To match a function literal exactly, wrap with lambda r: r is my_fn.

Composition

# OR
Retry("api", when=Match.exception(httpx.HTTPError) | Match.result(None))

# AND
Retry("api", when=Match.exception(httpx.HTTPError) & Match.exception(lambda e: e.status >= 500))

# NOT (one symmetric `not_*` per primitive)
Retry("api", when=Match.not_exception(ValueError))
Retry("api", when=Match.not_result(None))
Retry("api", when=Match.not_exception_message(contains="ok"))
Retry("api", when=Match.not_exception_cause(KeyError))

Use | for OR and & for AND. Each primitive (exception, result, exception_message, exception_cause) has a not_* twin for the negated form.

Worked example

import httpx

from grelmicro.resilience import Match, retry


# Compose exception and result matchers with `|`.
# Retry on transient HTTP errors OR when the response carries a
# server-side soft-fail marker.
@retry(
    when=Match.exception(httpx.HTTPError)
    | Match.result(lambda r: r.headers.get("X-Soft-Fail") == "true"),
    attempts=5,
)
async def fetch(client: httpx.AsyncClient, url: str) -> httpx.Response:
    return await client.get(url)


# Polling-style: retry until the result is no longer ``None``.
@retry(when=Match.result(None), attempts=20)
async def poll_job(client: httpx.AsyncClient, job_id: str) -> dict | None:
    response = await client.get(f"/jobs/{job_id}")
    payload = response.json()
    return payload if payload["status"] == "ready" else None

What never retries

asyncio.CancelledError, KeyboardInterrupt, and SystemExit are BaseException subclasses outside Exception. They always propagate, regardless of the Match you pass. This is required for correct asyncio shutdown.

Backoff algorithms

Retry ships five algorithms. The default is exponential with full jitter. Pick by purpose.

Algorithm Use when
ExponentialBackoff (default) Network and HTTP retries. Doubling delay with jitter avoids retry storms.
ConstantBackoff Polling-style retries (waiting for a job). Fixed interval is predictable.
LinearBackoff Steady, predictable growth without the early-attempt cluster of exponential.
FibonacciBackoff Smoother growth than exponential, faster than linear.
RandomBackoff Uniform random delay in a fixed range. Maximum spread, no growth.

The factory classmethods build the right config for you:

policy = Retry.exponential("payments", when=httpx.HTTPError, attempts=5)
polling = Retry.constant("wait_job", when=NotReady, attempts=20, delay=1.0)

Exponential

The raw wait before retry N is min(base_delay * 2 ** (N - 1), max_delay). It doubles each attempt until it reaches the cap. With the defaults (base_delay=0.1, max_delay=30.0), the raw wait is 0.1s, 0.2s, 0.4s, 0.8s, 1.6s, ..., capped at 30.0s.

Jitter then transforms each raw wait into the actual sleep (see Jitter below). The actual sleep may be smaller than the raw value.

Jitter

Jitter is randomness added to the wait so concurrent clients do not retry at the same instant and overwhelm the recovering server.

Pick a mode by how much spread you want. Default: full.

Jitter Spread When to use
none none Single client. Never with concurrency.
full (default) maximum The safe default.
equal half When you need predictable timing.
decorrelated adaptive High contention on a shared dependency.

Constant

A fixed delay between retries. One field: delay (seconds, default 1.0).

Configuration

Retry follows the three-paths configuration contract.

Programmatic

import httpx

from grelmicro.resilience import Retry

policy = Retry.exponential(
    "payments",
    when=httpx.HTTPError,
    attempts=5,
    base_delay=0.2,
    max_delay=10.0,
    jitter="full",
)


@policy
async def call_payments():
    pass

Declarative

import httpx

from grelmicro.resilience import (
    ExponentialBackoff,
    Match,
    Retry,
    RetryConfig,
)

config = RetryConfig(
    attempts=5,
    when=Match.exception(httpx.HTTPError),
    backoff=ExponentialBackoff(base_delay=0.2, max_delay=10.0, jitter="full"),
)
policy = Retry("payments", config=config)

Environmental

Prefix: GREL_RETRY_{NAME_UPPER}_

Env var Field Type Default
GREL_RETRY_{NAME_UPPER}_ATTEMPTS attempts int (>= 1) 3
GREL_RETRY_{NAME_UPPER}_WHEN when CSV or JSON list of FQN strings (e.g. httpx.HTTPError). Coerced to Match.exception(...). Predicate forms cannot come from env. required
GREL_RETRY_{NAME_UPPER}_BACKOFF backoff JSON object with a kind field (see below) {"kind":"exponential"}

The full backoff config is a discriminated Pydantic union, so the env value is parsed as one JSON object. Each algorithm accepts the same fields it takes in code:

kind Fields
exponential base_delay, max_delay, jitter (none / full / equal / decorrelated)
constant delay
linear base_delay, max_delay
fibonacci base_delay, max_delay
random min_delay, max_delay
import httpx

from grelmicro.resilience import Retry

# Reads the config from environment variables. The backoff field is
# a discriminated union, pass it as a single JSON object.
#
# - GREL_RETRY_PAYMENTS_ATTEMPTS=5
# - GREL_RETRY_PAYMENTS_WHEN=httpx.HTTPError
# - GREL_RETRY_PAYMENTS_BACKOFF={"kind":"exponential","base_delay":0.2}
policy = Retry("payments", when=httpx.HTTPError)

The callable form of when cannot come from env. Use the FQN list for env-driven configs.

Composition with Circuit Breaker

Retry and Circuit Breaker compose by intent. When the breaker is OPEN, it raises CircuitBreakerError. Pick a narrow when= allowlist so the retry loop does not swallow that signal:

import httpx

from grelmicro.resilience import CircuitBreaker, retry

cb = CircuitBreaker("payments")


# A narrow allowlist that excludes CircuitBreakerError. When the
# breaker is open it raises CircuitBreakerError, which is not in
# `on`, so the retry loop aborts immediately.
@retry(when=(httpx.ConnectError, httpx.TimeoutException), attempts=3)
async def call_payments(
    client: httpx.AsyncClient, url: str, payload: dict
) -> dict:
    async with cb:
        response = await client.post(url, json=payload)
        response.raise_for_status()
        return response.json()


async def main() -> dict:
    async with httpx.AsyncClient() as client:
        return await call_payments(client, "https://example.com", {"k": "v"})

A broad allowlist (when=Exception) would retry through the open breaker. The narrow allowlist lets the breaker do its job.

Behavior on exhaustion

When attempts is exhausted, the underlying exception is re-raised with a PEP 678 note attached:

try:
    await fetch(url)
except httpx.ConnectError as exc:
    print(exc.__notes__)
    # ['retry: 3/3 attempts exhausted in 1.40s (exponential backoff)']

Callers catch the underlying error type, unchanged. There is no RetryError wrapper class.

Live reconfiguration

Retry inherits Reconfigurable[RetryConfig]. Calling policy.reconfigure(new_config) swaps the snapshot for future loops. An in-flight async for attempt in policy: keeps its snapshot until it completes. See Live reconfiguration.

Reference

See the API reference for every option.