Skip to content

Callbacks & Events

Paygent fires three types of callbacks:

  • on_soft_gate — user is approaching a limit
  • on_hard_gate — user is over a limit (call is blocked)
  • on_usage — every successful metered call

Plus there's the exception you catch in your request handler when a hard gate raises.

This page covers all four with working FastAPI / Flask examples.

on_soft_gate(callback)

Fires when usage hits soft_gate_at (default 80%) on any dimension. The LLM call still runs — soft gate is a warning, not a block.

The callback receives a GuardResult:

class GuardResult:
    status: Literal["ok", "soft_gate", "hard_gate"]   # "soft_gate" here
    gate_reason: str | None                            # see table below
    usage_pct: float                                   # 0.80 → 80% of limit
    current_value: float                               # current dollars or tokens
    limit_value: float                                 # the cap
    message: str | None                                # human-readable

gate_reason values:

Value Meaning
"total_spend" Period spend limit
"session_spend" Session spend limit
"model_limit:gpt-4o" Per-model token cap (model name follows the colon)

Use it for

  • Surfacing a warning banner in your UI
  • Logging quota events to your observability stack
  • Triggering a "you should upgrade" email
  • Pre-emptively switching the user to a cheaper model

Complete example

from paygent import Paygent, paygent_context

pg = Paygent.init(api_key="pg_live_...")

def warn(result):
    print(f"⚠ {result.message}")
    # gate_reason starts with "model_limit:" for model-specific gates
    if result.gate_reason and result.gate_reason.startswith("model_limit:"):
        model = result.gate_reason.split(":", 1)[1]
        print(f"   → model {model}: {result.current_value:,} / "
              f"{result.limit_value:,} tokens ({result.usage_pct:.0%})")
    else:
        print(f"   → {result.gate_reason}: "
              f"${result.current_value:.2f} / ${result.limit_value:.2f}")

pg.on_soft_gate(warn)

with paygent_context(user_id="user_123"):
    response = client.chat.completions.create(...)
// Coming soon

Output when the user is at 87%:

⚠ Approaching spend limit: 87% used
   → total_spend: $42.63 / $49.00

Soft gates are also synced as audit events

Every soft gate fire is recorded as a GateEvent and synced to the backend (rate-limited to one row per (user_id, gate_reason) per 5 seconds — the SDK won't flood the audit trail when a user is hammering the API near the cap). Query them via GET /users/{id}/gate-events. See Backend API → Gate events.

on_hard_gate(callback)

Fires when usage hits hard_gate_at (default 100%). The LLM call is blocked — Paygent does not call OpenAI/Anthropic. No tokens, no cost.

The callback fires before PaygentLimitExceeded is raised. So if you want to log the block, the callback is the right place — even if the developer's code handles the exception, the callback already ran.

If you set raise_on_hard_gate=False on Paygent.init(), the callback fires but no exception is raised — the call proceeds anyway. Useful during a soft launch when you want telemetry on who's exceeding limits without enforcing yet.

Use it for

  • Logging blocks to your alerting system (PagerDuty, Sentry)
  • Sending a "you've hit your limit, please upgrade" notification
  • Recording the block in your own analytics
  • Any cleanup before the exception bubbles up

Complete example

import logging
from paygent import Paygent

logger = logging.getLogger(__name__)
pg = Paygent.init(api_key="pg_live_...")

def on_block(result):
    # This runs before the exception is raised.
    logger.error(
        "User hit limit | reason=%s pct=%.2f current=%.2f limit=%.2f",
        result.gate_reason, result.usage_pct,
        result.current_value, result.limit_value,
    )

    # Maybe trigger a workflow
    if result.gate_reason == "total_spend":
        send_upgrade_email(current_user_id())
    elif result.gate_reason and result.gate_reason.startswith("model_limit:"):
        log_to_analytics("model_quota_hit", model=result.gate_reason)

pg.on_hard_gate(on_block)
// Coming soon

on_usage(callback)

Fires after every successful metered call. The callback receives a UsageEvent:

class UsageEvent:
    id: str                    # UUID, idempotency key
    user_id: str
    session_id: str | None
    timestamp: datetime
    model: str | None          # normalized (e.g. "gpt-4o-mini", not "...-2024-07-18")
    input_tokens: int
    output_tokens: int
    total_tokens: int
    tool_calls: list[str]      # tool names invoked
    cost_tokens: float         # cost from token usage
    cost_tools: float          # cost from tool calls
    cost_total: float          # sum of above
    metadata: dict             # whatever you put in paygent_context(metadata=...)
    synced: bool               # has this been pushed to backend yet

Use it for

  • Real-time dashboard updates
  • Forwarding to your own analytics (Segment, Mixpanel, custom)
  • Cost-tracking in your own DB alongside Paygent
  • Custom alerting (e.g. "this user used $5 in 5 minutes")

Complete example

pg = Paygent.init(api_key="pg_live_...")

def log_call(event):
    print(
        f"[{event.timestamp.isoformat()}] "
        f"{event.user_id} {event.model} "
        f"in={event.input_tokens} out={event.output_tokens} "
        f"${event.cost_total:.4f}"
    )

    # Forward to your own pipeline
    my_segment_client.track(
        event.user_id,
        "llm_call",
        properties={
            "model": event.model,
            "tokens": event.total_tokens,
            "cost": event.cost_total,
            **event.metadata,
        },
    )

pg.on_usage(log_call)
// Coming soon

The callback fires for every metering path: auto-instrumented OpenAI calls, pg.wrap(), pg.awrap(), and the LangChain / CrewAI callbacks. So it's the single place to plug in your own analytics.

on_session_start(callback)

Fires when a session is first loaded for a user — typically on their first call. Callback receives the UserState:

def on_start(state):
    print(f"Session start: user={state.user_id} plan={state.plan}")
    if state.billing_period:
        print(f"  Period: {state.billing_period.start}{state.billing_period.end}")

pg.on_session_start(on_start)
// Coming soon

Useful for debugging which plan a user landed on, or for warming up your own per-user caches.

Multiple callbacks

You can register more than one of each type. They all fire in registration order. If one of them throws, the others still fire — Paygent catches exceptions per-callback.

pg.on_soft_gate(send_warning_email)
pg.on_soft_gate(log_to_metrics)
pg.on_soft_gate(update_dashboard)

# All three fire on every soft gate, even if send_warning_email throws.
// Coming soon

There's no unregister. The list is meant to be set up once at startup.

Catching PaygentLimitExceeded

When raise_on_hard_gate=True (the default) and a hard gate fires, the LLM call doesn't run and PaygentLimitExceeded is raised. Catch it in your request handler.

The exception carries the full GuardResult:

try:
    response = client.chat.completions.create(...)
except PaygentLimitExceeded as e:
    e.guard_result.gate_reason   # "total_spend", "model_limit:gpt-4o", etc.
    e.guard_result.message       # human-readable
    e.guard_result.usage_pct     # how far over (1.02 = 102%)

FastAPI

from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel
from openai import OpenAI

from paygent import Paygent, PaygentLimitExceeded, paygent_context

pg = Paygent.init(api_key=os.environ["PAYGENT_API_KEY"])
client = OpenAI()
app = FastAPI()

class ChatRequest(BaseModel):
    user_id: str
    message: str

@app.post("/chat")
async def chat(req: ChatRequest):
    try:
        with paygent_context(user_id=req.user_id):
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": req.message}],
            )
        return {"reply": response.choices[0].message.content}

    except PaygentLimitExceeded as e:
        raise HTTPException(
            status_code=429,
            detail={
                "error": "limit_reached",
                "reason": e.guard_result.gate_reason,
                "message": e.guard_result.message,
                "usage_pct": e.guard_result.usage_pct,
            },
        )
// Coming soon

429 is the conventional status — "Too Many Requests" / "rate-limited" in spirit, even though the cap is dollar-based.

Flask

from flask import Flask, request, jsonify
from openai import OpenAI

from paygent import Paygent, PaygentLimitExceeded, paygent_context

pg = Paygent.init(api_key=os.environ["PAYGENT_API_KEY"])
client = OpenAI()
app = Flask(__name__)

@app.post("/chat")
def chat():
    body = request.get_json()
    user_id = body["user_id"]

    try:
        with paygent_context(user_id=user_id):
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": body["message"]}],
            )
        return jsonify({"reply": response.choices[0].message.content})

    except PaygentLimitExceeded as e:
        return jsonify({
            "error": "limit_reached",
            "reason": e.guard_result.gate_reason,
            "message": e.guard_result.message,
        }), 429
// Coming soon

Don't want to handle exceptions?

Pass raise_on_hard_gate=False to Paygent.init(). Then:

  • on_hard_gate callbacks still fire
  • The LLM call executes anyway
  • Your code looks identical to a successful call
  • The user's period_cost ticks up past the cap

This mode is useful during a rollout: get telemetry on who would've been blocked without actually blocking yet. Once you're confident in your limits, flip raise_on_hard_gate=True.

Pre-flight checks (no exception, just ask)

If you'd rather check the gate before making the call:

if pg.is_within_limit("user_123", model="gpt-4o"):
    response = client.chat.completions.create(model="gpt-4o", ...)
else:
    return "You've hit your GPT-4o quota. Upgrade or use gpt-4o-mini."
// Coming soon

Or for the full picture:

budget = pg.get_remaining_budget("user_123")
print(f"Period:     ${budget.period_spend_remaining:.2f} left")
print(f"Session:    ${budget.session_spend_remaining:.2f} left")
print(f"By model:   {budget.model_tokens_remaining}")
print(f"Tightest:   {budget.most_constrained}")
// Coming soon

For computing a safe max_tokens value to pass to OpenAI:

advice = pg.get_max_tokens(
    "user_123",
    model="gpt-4o",
    messages=[{"role": "user", "content": "..."}],   # paygent estimates input tokens
)
if advice.max_tokens == 0:
    return "Budget exhausted"

response = client.chat.completions.create(
    model="gpt-4o",
    messages=...,
    max_tokens=advice.max_tokens,   # guaranteed not to overshoot any limit
)
// Coming soon

advice.binding_limit tells you which dimension is tightest (period_spend, session_spend, model_tokens, unbounded, or blocked).

Next steps

  • Frameworks — LangChain / CrewAI integration with the same callback model
  • Streaming — when token capture happens for streamed responses
  • SDK Reference — every callback signature