Guardrails: gates, limits, and violations
How Paygent's guard checks work: three dimensions (period spend, session spend, model tokens), soft vs hard gates, and pre-flight check methods.
Every LLM call inside a paygent_context block passes through a guard check before reaching the provider. The guard runs three independent checks in microseconds, returns the most restrictive violation, and either lets the call through, fires a callback, or raises an exception — all before a single token leaves your system. This page explains how that system works and how to interact with it.
Three independent checks
The guard always evaluates the same three dimensions:
| # | Check | Compared against | Unit |
|---|---|---|---|
| 1 | Period spend | max_spend_per_period |
dollars |
| 2 | Session spend | max_spend_per_session |
dollars |
| 3 | Per-model tokens | model_limits[model].max_tokens_per_period |
tokens |
Each check is closed under its own unit — the SDK never compares dollars to tokens. A user can hit a model token limit while still well within their dollar cap, and vice versa.
If a limit is not configured (set to null or absent from the plan), that check is skipped entirely — Paygent treats missing limits as inf.
Soft gate vs hard gate
| Soft gate | Hard gate | |
|---|---|---|
| When it fires | At soft_gate_at fraction (default 80%) |
At hard_gate_at fraction (default 100%) |
| Call to LLM | Executes normally | Blocked before reaching the provider |
| Tokens consumed | Yes | No |
| Cost incurred | Yes | No |
| Callback fired | on_soft_gate(result) |
on_hard_gate(result) |
| Exception raised | None | PaygentLimitExceeded (if raise_on_hard_gate=True) |
The hard gate runs before the original create call. A blocked user costs you nothing — no tokens are sent to the provider.
import logging
from paygent import PaygentLimitExceeded
log = logging.getLogger(__name__)
def on_approaching(result):
log.warning("Soft gate: %s (%.0f%% used)", result.message, result.usage_pct * 100)
def on_blocked(result):
log.error("Hard gate: %s (reason=%s)", result.message, result.gate_reason)
# Your own integrations go here — e.g. surface a banner in your UI,
# send an upgrade-prompt email, or push to your alerting system.
pg.on_soft_gate(on_approaching)
pg.on_hard_gate(on_blocked)
try:
with paygent_context(user_id="user_123"):
response = openai.chat.completions.create(...)
except PaygentLimitExceeded as e:
# Convert to a 429 response in your handler, or show the user
# an upgrade prompt — see Callbacks & Events for FastAPI / Flask patterns.
print(f"Blocked: {e.guard_result.message}")
How the guard picks the most restrictive violation
When more than one dimension is in violation simultaneously, the guard returns a single result:
- Hard gate beats soft gate. If period spend is at a soft gate (85%) and model tokens are at a hard gate (102%), the result is
hard_gate. - Highest
usage_pctwins within the same severity. If two dimensions are both at soft gate, the one with the higher percentage is returned.
# Example: user at 95% on period spend AND 102% on gpt-4o tokens.
# Both trip — hard gate wins. usage_pct of the model check is higher.
GuardResult(
status="hard_gate",
gate_reason="model_limit:gpt-4o",
usage_pct=1.02,
current_value=51000,
limit_value=50000,
message="gpt-4o token limit reached: 51,000 of 50,000",
)
The gate_reason always reflects the single tightest constraint so your callbacks and error messages are maximally actionable.
GuardResult fields
| Field | Type | Description |
|---|---|---|
status |
"ok" | "soft_gate" | "hard_gate" |
The guard decision. |
gate_reason |
string | null | Which limit fired: "total_spend", "session_spend", or "model_limit:<model>". |
usage_pct |
float | current_value / limit_value. Values above 1.0 are possible on hard gates. |
current_value |
float | Current spend (dollars) or token count at the time of the check. |
limit_value |
float | The limit being checked. |
message |
string | null | Human-readable description, e.g. "gpt-4o token limit reached: 51,000 of 50,000". |
gate_reason values
| Value | What it means |
|---|---|
"total_spend" |
Period dollar cap was reached or approached. |
"session_spend" |
Session dollar cap was reached or approached. |
"model_limit:gpt-4o" |
GPT-4o token limit was reached or approached. Model name is appended after the colon. |
Session windows
The session spend check uses a clock-based window. Each window lasts session_timeout_minutes (default 30 min). When the current window expires:
session_costresets to 0.session_idrotates to a new UUID.session_started_atupdates to the current time.
Rotation happens automatically during the guard check — you do not need to manage sessions manually. Set session_timeout_minutes in your plan to control how long a session window lasts:
"session_timeout_minutes": 30.0 # $5 per 30-minute window
Each session ID is recorded on every UsageEvent, so you can group events by session on the backend to reconstruct conversation-level cost.
Catching PaygentLimitExceeded
PaygentLimitExceeded is raised at a hard gate when raise_on_hard_gate=True (the default). Catch it wherever you handle LLM call failures:
from paygent import PaygentLimitExceeded
try:
with paygent_context(user_id="user_123"):
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}],
)
except PaygentLimitExceeded as e:
guard = e.guard_result
if guard.gate_reason == "total_spend":
return "You've reached your monthly limit. Upgrade your plan to continue."
elif guard.gate_reason and guard.gate_reason.startswith("model_limit:"):
model = guard.gate_reason.split(":")[1]
return f"You've used all your {model} tokens for this period."
return "Usage limit reached."
Soft launch: let hard-gated calls through
Set raise_on_hard_gate=False during a soft launch to fire the callback and log the event without stopping the call:
pg = Paygent.init(api_key="pg_live_...", raise_on_hard_gate=False)
The on_hard_gate callback still fires and the gate event is recorded in the audit trail. The LLM call runs. This is useful when you want to measure how often limits would fire before enforcing them in production.
Pre-flight checks
Check whether the next call is allowed before making it — useful for showing remaining budget in a UI, deciding which model to use, or routing to a cheaper fallback.
pg.check_guard()
Returns the full GuardResult for a user and optional model:
guard = pg.check_guard("user_123", model="gpt-4o")
if guard.status == "hard_gate":
print(f"Blocked: {guard.message}")
elif guard.status == "soft_gate":
print(f"Warning: {guard.message} ({guard.usage_pct:.0%} used)")
else:
print("OK to proceed")
pg.is_within_limit()
Quick boolean for the common case:
if pg.is_within_limit("user_123", model="gpt-4o"):
response = openai.chat.completions.create(...)
else:
return "Your GPT-4o tokens are used up for this period."
pg.get_max_tokens()
Recommend a safe max_tokens value for the next call, bounded by whatever limits remain — especially useful for streaming or long-form generation where you want to avoid cutting off mid-response:
advice = pg.get_max_tokens(
"user_123",
model="gpt-4o-mini",
messages=my_messages, # Paygent estimates input tokens from this
)
if advice.max_tokens == 0:
return f"Budget exhausted: {advice.binding_limit}"
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=my_messages,
max_tokens=advice.max_tokens, # never pushes the user past any limit
)
Tip
Pre-flight checks read from the same in-memory cache as the in-call guard. They are fast (no network round-trip) and reflect the most recent metered usage. Use them freely — they add negligible overhead.