Callbacks & Events
Paygent fires three types of callbacks:
on_soft_gate— user is approaching a limiton_hard_gate— user is over a limit (call is blocked)on_usage— every successful metered call
Plus there's the exception you catch in your request handler when a hard gate raises.
This page covers all four with working FastAPI / Flask examples.
on_soft_gate(callback)
Fires when usage hits soft_gate_at (default 80%) on any dimension. The LLM call still runs — soft gate is a warning, not a block.
The callback receives a GuardResult:
class GuardResult:
status: Literal["ok", "soft_gate", "hard_gate"] # "soft_gate" here
gate_reason: str | None # see table below
usage_pct: float # 0.80 → 80% of limit
current_value: float # current dollars or tokens
limit_value: float # the cap
message: str | None # human-readable
gate_reason values:
| Value | Meaning |
|---|---|
"total_spend" |
Period spend limit |
"session_spend" |
Session spend limit |
"model_limit:gpt-4o" |
Per-model token cap (model name follows the colon) |
Use it for
- Surfacing a warning banner in your UI
- Logging quota events to your observability stack
- Triggering a "you should upgrade" email
- Pre-emptively switching the user to a cheaper model
Complete example
from paygent import Paygent, paygent_context
pg = Paygent.init(api_key="pg_live_...")
def warn(result):
print(f"⚠ {result.message}")
# gate_reason starts with "model_limit:" for model-specific gates
if result.gate_reason and result.gate_reason.startswith("model_limit:"):
model = result.gate_reason.split(":", 1)[1]
print(f" → model {model}: {result.current_value:,} / "
f"{result.limit_value:,} tokens ({result.usage_pct:.0%})")
else:
print(f" → {result.gate_reason}: "
f"${result.current_value:.2f} / ${result.limit_value:.2f}")
pg.on_soft_gate(warn)
with paygent_context(user_id="user_123"):
response = client.chat.completions.create(...)
// Coming soon
Output when the user is at 87%:
⚠ Approaching spend limit: 87% used
→ total_spend: $42.63 / $49.00
Soft gates are also synced as audit events
Every soft gate fire is recorded as a GateEvent and synced to the backend (rate-limited to one row per (user_id, gate_reason) per 5 seconds — the SDK won't flood the audit trail when a user is hammering the API near the cap). Query them via GET /users/{id}/gate-events. See Backend API → Gate events.
on_hard_gate(callback)
Fires when usage hits hard_gate_at (default 100%). The LLM call is blocked — Paygent does not call OpenAI/Anthropic. No tokens, no cost.
The callback fires before PaygentLimitExceeded is raised. So if you want to log the block, the callback is the right place — even if the developer's code handles the exception, the callback already ran.
If you set raise_on_hard_gate=False on Paygent.init(), the callback fires but no exception is raised — the call proceeds anyway. Useful during a soft launch when you want telemetry on who's exceeding limits without enforcing yet.
Use it for
- Logging blocks to your alerting system (PagerDuty, Sentry)
- Sending a "you've hit your limit, please upgrade" notification
- Recording the block in your own analytics
- Any cleanup before the exception bubbles up
Complete example
import logging
from paygent import Paygent
logger = logging.getLogger(__name__)
pg = Paygent.init(api_key="pg_live_...")
def on_block(result):
# This runs before the exception is raised.
logger.error(
"User hit limit | reason=%s pct=%.2f current=%.2f limit=%.2f",
result.gate_reason, result.usage_pct,
result.current_value, result.limit_value,
)
# Maybe trigger a workflow
if result.gate_reason == "total_spend":
send_upgrade_email(current_user_id())
elif result.gate_reason and result.gate_reason.startswith("model_limit:"):
log_to_analytics("model_quota_hit", model=result.gate_reason)
pg.on_hard_gate(on_block)
// Coming soon
on_usage(callback)
Fires after every successful metered call. The callback receives a UsageEvent:
class UsageEvent:
id: str # UUID, idempotency key
user_id: str
session_id: str | None
timestamp: datetime
model: str | None # normalized (e.g. "gpt-4o-mini", not "...-2024-07-18")
input_tokens: int
output_tokens: int
total_tokens: int
tool_calls: list[str] # tool names invoked
cost_tokens: float # cost from token usage
cost_tools: float # cost from tool calls
cost_total: float # sum of above
metadata: dict # whatever you put in paygent_context(metadata=...)
synced: bool # has this been pushed to backend yet
Use it for
- Real-time dashboard updates
- Forwarding to your own analytics (Segment, Mixpanel, custom)
- Cost-tracking in your own DB alongside Paygent
- Custom alerting (e.g. "this user used $5 in 5 minutes")
Complete example
pg = Paygent.init(api_key="pg_live_...")
def log_call(event):
print(
f"[{event.timestamp.isoformat()}] "
f"{event.user_id} {event.model} "
f"in={event.input_tokens} out={event.output_tokens} "
f"${event.cost_total:.4f}"
)
# Forward to your own pipeline
my_segment_client.track(
event.user_id,
"llm_call",
properties={
"model": event.model,
"tokens": event.total_tokens,
"cost": event.cost_total,
**event.metadata,
},
)
pg.on_usage(log_call)
// Coming soon
The callback fires for every metering path: auto-instrumented OpenAI calls, pg.wrap(), pg.awrap(), and the LangChain / CrewAI callbacks. So it's the single place to plug in your own analytics.
on_session_start(callback)
Fires when a session is first loaded for a user — typically on their first call. Callback receives the UserState:
def on_start(state):
print(f"Session start: user={state.user_id} plan={state.plan}")
if state.billing_period:
print(f" Period: {state.billing_period.start} → {state.billing_period.end}")
pg.on_session_start(on_start)
// Coming soon
Useful for debugging which plan a user landed on, or for warming up your own per-user caches.
Multiple callbacks
You can register more than one of each type. They all fire in registration order. If one of them throws, the others still fire — Paygent catches exceptions per-callback.
pg.on_soft_gate(send_warning_email)
pg.on_soft_gate(log_to_metrics)
pg.on_soft_gate(update_dashboard)
# All three fire on every soft gate, even if send_warning_email throws.
// Coming soon
There's no unregister. The list is meant to be set up once at startup.
Catching PaygentLimitExceeded
When raise_on_hard_gate=True (the default) and a hard gate fires, the LLM call doesn't run and PaygentLimitExceeded is raised. Catch it in your request handler.
The exception carries the full GuardResult:
try:
response = client.chat.completions.create(...)
except PaygentLimitExceeded as e:
e.guard_result.gate_reason # "total_spend", "model_limit:gpt-4o", etc.
e.guard_result.message # human-readable
e.guard_result.usage_pct # how far over (1.02 = 102%)
FastAPI
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel
from openai import OpenAI
from paygent import Paygent, PaygentLimitExceeded, paygent_context
pg = Paygent.init(api_key=os.environ["PAYGENT_API_KEY"])
client = OpenAI()
app = FastAPI()
class ChatRequest(BaseModel):
user_id: str
message: str
@app.post("/chat")
async def chat(req: ChatRequest):
try:
with paygent_context(user_id=req.user_id):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": req.message}],
)
return {"reply": response.choices[0].message.content}
except PaygentLimitExceeded as e:
raise HTTPException(
status_code=429,
detail={
"error": "limit_reached",
"reason": e.guard_result.gate_reason,
"message": e.guard_result.message,
"usage_pct": e.guard_result.usage_pct,
},
)
// Coming soon
429 is the conventional status — "Too Many Requests" / "rate-limited" in spirit, even though the cap is dollar-based.
Flask
from flask import Flask, request, jsonify
from openai import OpenAI
from paygent import Paygent, PaygentLimitExceeded, paygent_context
pg = Paygent.init(api_key=os.environ["PAYGENT_API_KEY"])
client = OpenAI()
app = Flask(__name__)
@app.post("/chat")
def chat():
body = request.get_json()
user_id = body["user_id"]
try:
with paygent_context(user_id=user_id):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": body["message"]}],
)
return jsonify({"reply": response.choices[0].message.content})
except PaygentLimitExceeded as e:
return jsonify({
"error": "limit_reached",
"reason": e.guard_result.gate_reason,
"message": e.guard_result.message,
}), 429
// Coming soon
Don't want to handle exceptions?
Pass raise_on_hard_gate=False to Paygent.init(). Then:
on_hard_gatecallbacks still fire- The LLM call executes anyway
- Your code looks identical to a successful call
- The user's
period_costticks up past the cap
This mode is useful during a rollout: get telemetry on who would've been blocked without actually blocking yet. Once you're confident in your limits, flip raise_on_hard_gate=True.
Pre-flight checks (no exception, just ask)
If you'd rather check the gate before making the call:
if pg.is_within_limit("user_123", model="gpt-4o"):
response = client.chat.completions.create(model="gpt-4o", ...)
else:
return "You've hit your GPT-4o quota. Upgrade or use gpt-4o-mini."
// Coming soon
Or for the full picture:
budget = pg.get_remaining_budget("user_123")
print(f"Period: ${budget.period_spend_remaining:.2f} left")
print(f"Session: ${budget.session_spend_remaining:.2f} left")
print(f"By model: {budget.model_tokens_remaining}")
print(f"Tightest: {budget.most_constrained}")
// Coming soon
For computing a safe max_tokens value to pass to OpenAI:
advice = pg.get_max_tokens(
"user_123",
model="gpt-4o",
messages=[{"role": "user", "content": "..."}], # paygent estimates input tokens
)
if advice.max_tokens == 0:
return "Budget exhausted"
response = client.chat.completions.create(
model="gpt-4o",
messages=...,
max_tokens=advice.max_tokens, # guaranteed not to overshoot any limit
)
// Coming soon
advice.binding_limit tells you which dimension is tightest (period_spend, session_spend, model_tokens, unbounded, or blocked).
Next steps
- Frameworks — LangChain / CrewAI integration with the same callback model
- Streaming — when token capture happens for streamed responses
- SDK Reference — every callback signature