Guardrails: gates, limits, and violations

How Paygent's guard checks work: three dimensions (period spend, session spend, model tokens), soft vs hard gates, and pre-flight check methods.

Every LLM call inside a paygent_context block passes through a guard check before reaching the provider. The guard runs three independent checks in microseconds, returns the most restrictive violation, and either lets the call through, fires a callback, or raises an exception — all before a single token leaves your system. This page explains how that system works and how to interact with it.

Three independent checks

The guard always evaluates the same three dimensions:

#	Check	Compared against	Unit
1	Period spend	`max_spend_per_period`	dollars
2	Session spend	`max_spend_per_session`	dollars
3	Per-model tokens	`model_limits[model].max_tokens_per_period`	tokens

Each check is closed under its own unit — the SDK never compares dollars to tokens. A user can hit a model token limit while still well within their dollar cap, and vice versa.

If a limit is not configured (set to null or absent from the plan), that check is skipped entirely — Paygent treats missing limits as inf.

Soft gate vs hard gate

	Soft gate	Hard gate
When it fires	At `soft_gate_at` fraction (default 80%)	At `hard_gate_at` fraction (default 100%)
Call to LLM	Executes normally	Blocked before reaching the provider
Tokens consumed	Yes	No
Cost incurred	Yes	No
Callback fired	`on_soft_gate(result)`	`on_hard_gate(result)`
Exception raised	None	`PaygentLimitExceeded` (if `raise_on_hard_gate=True`)

The hard gate runs before the original create call. A blocked user costs you nothing — no tokens are sent to the provider.

import logging
from paygent import PaygentLimitExceeded

log = logging.getLogger(__name__)

def on_approaching(result):
    log.warning("Soft gate: %s (%.0f%% used)", result.message, result.usage_pct * 100)

def on_blocked(result):
    log.error("Hard gate: %s (reason=%s)", result.message, result.gate_reason)
    # Your own integrations go here — e.g. surface a banner in your UI,
    # send an upgrade-prompt email, or push to your alerting system.

pg.on_soft_gate(on_approaching)
pg.on_hard_gate(on_blocked)

try:
    with paygent_context(user_id="user_123"):
        response = openai.chat.completions.create(...)
except PaygentLimitExceeded as e:
    # Convert to a 429 response in your handler, or show the user
    # an upgrade prompt — see Callbacks & Events for FastAPI / Flask patterns.
    print(f"Blocked: {e.guard_result.message}")

How the guard picks the most restrictive violation

When more than one dimension is in violation simultaneously, the guard returns a single result:

Hard gate beats soft gate. If period spend is at a soft gate (85%) and model tokens are at a hard gate (102%), the result is hard_gate.
Highest usage_pct wins within the same severity. If two dimensions are both at soft gate, the one with the higher percentage is returned.

# Example: user at 95% on period spend AND 102% on gpt-4o tokens.
# Both trip — hard gate wins. usage_pct of the model check is higher.

GuardResult(
    status="hard_gate",
    gate_reason="model_limit:gpt-4o",
    usage_pct=1.02,
    current_value=51000,
    limit_value=50000,
    message="gpt-4o token limit reached: 51,000 of 50,000",
)

The gate_reason always reflects the single tightest constraint so your callbacks and error messages are maximally actionable.

GuardResult fields

Field	Type	Description
`status`	`"ok"` \| `"soft_gate"` \| `"hard_gate"`	The guard decision.
`gate_reason`	string \| null	Which limit fired: `"total_spend"`, `"session_spend"`, or `"model_limit:<model>"`.
`usage_pct`	float	`current_value / limit_value`. Values above 1.0 are possible on hard gates.
`current_value`	float	Current spend (dollars) or token count at the time of the check.
`limit_value`	float	The limit being checked.
`message`	string \| null	Human-readable description, e.g. `"gpt-4o token limit reached: 51,000 of 50,000"`.

`gate_reason` values

Value	What it means
`"total_spend"`	Period dollar cap was reached or approached.
`"session_spend"`	Session dollar cap was reached or approached.
`"model_limit:gpt-4o"`	GPT-4o token limit was reached or approached. Model name is appended after the colon.

Session windows

The session spend check uses a clock-based window. Each window lasts session_timeout_minutes (default 30 min). When the current window expires:

session_cost resets to 0.
session_id rotates to a new UUID.
session_started_at updates to the current time.

Rotation happens automatically during the guard check — you do not need to manage sessions manually. Set session_timeout_minutes in your plan to control how long a session window lasts:

"session_timeout_minutes": 30.0   # $5 per 30-minute window

Each session ID is recorded on every UsageEvent, so you can group events by session on the backend to reconstruct conversation-level cost.

Catching PaygentLimitExceeded

PaygentLimitExceeded is raised at a hard gate when raise_on_hard_gate=True (the default). Catch it wherever you handle LLM call failures:

from paygent import PaygentLimitExceeded

try:
    with paygent_context(user_id="user_123"):
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": query}],
        )
except PaygentLimitExceeded as e:
    guard = e.guard_result
    if guard.gate_reason == "total_spend":
        return "You've reached your monthly limit. Upgrade your plan to continue."
    elif guard.gate_reason and guard.gate_reason.startswith("model_limit:"):
        model = guard.gate_reason.split(":")[1]
        return f"You've used all your {model} tokens for this period."
    return "Usage limit reached."

Soft launch: let hard-gated calls through

Set raise_on_hard_gate=False during a soft launch to fire the callback and log the event without stopping the call:

pg = Paygent.init(api_key="pg_live_...", raise_on_hard_gate=False)

The on_hard_gate callback still fires and the gate event is recorded in the audit trail. The LLM call runs. This is useful when you want to measure how often limits would fire before enforcing them in production.

Pre-flight checks

Check whether the next call is allowed before making it — useful for showing remaining budget in a UI, deciding which model to use, or routing to a cheaper fallback.

`pg.check_guard()`

Returns the full GuardResult for a user and optional model:

guard = pg.check_guard("user_123", model="gpt-4o")

if guard.status == "hard_gate":
    print(f"Blocked: {guard.message}")
elif guard.status == "soft_gate":
    print(f"Warning: {guard.message} ({guard.usage_pct:.0%} used)")
else:
    print("OK to proceed")

`pg.is_within_limit()`

Quick boolean for the common case:

if pg.is_within_limit("user_123", model="gpt-4o"):
    response = openai.chat.completions.create(...)
else:
    return "Your GPT-4o tokens are used up for this period."

`pg.get_max_tokens()`

Recommend a safe max_tokens value for the next call, bounded by whatever limits remain — especially useful for streaming or long-form generation where you want to avoid cutting off mid-response:

advice = pg.get_max_tokens(
    "user_123",
    model="gpt-4o-mini",
    messages=my_messages,  # Paygent estimates input tokens from this
)

if advice.max_tokens == 0:
    return f"Budget exhausted: {advice.binding_limit}"

response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=my_messages,
    max_tokens=advice.max_tokens,  # never pushes the user past any limit
)

Tip

Pre-flight checks read from the same in-memory cache as the in-call guard. They are fast (no network round-trip) and reflect the most recent metered usage. Use them freely — they add negligible overhead.