Webhook monitoring and alerting metrics, SLOs, incident response
Webhooks are production infrastructure. If you can’t measure backlog, retries, and processing latency, you’re flying blind.
This guide shows what to measure, what to alert on, and how to respond during incidents.
No credit card required
TL;DR
- Define SLOs: ingest success rate, end-to-end latency, backlog age, and DLQ rate.
- Track the right metrics: 2xx/4xx/5xx, signature failures, retries by reason, time-to-ack.
- Alert on sustained failures (not single spikes) and on backlog growth/backlog age.
- Treat spikes in signature failures as incidents (misconfig or attack).
- Have a “stop-the-bleed” playbook: pause processing, quarantine poison messages, replay safely after fixes.
- Measure duplicates and dedupe hit rate; it’s the only way to know if idempotency works.
If you are debugging delivery failures, use the debugging playbook.
Anti-patterns
- Only tracking total webhook count (you need latency, backlog, and failure reasons).
- Alerting on single failures instead of sustained conditions and rates.
- No replay/validation path after incidents (you can’t prove recovery).
Reliability patterns that reduce incidents are covered in retries & backoff.
Core concepts
Monitor webhooks like a queueing system: backlog is risk, latency is customer impact, and retries are signal.
Backlog age > backlog depth
Depth can be high during bursts. Age tells you how stale events are and whether you are violating expectations.
Time-to-ack
End-to-end latency means “from ingest to side effect.” Track p95/p99 and alert on sustained regressions.
Retry reasons
Retries reveal dependency failures. Track failure classes (timeouts, 5xx, validation) and prioritize fixes.
Simple flow
Ingest
Accepted
2xx rate
Queue
Backlog
age + depth
Worker
Outcomes
Ack/Nack/DLQ
If you don’t have “backlog age” and “time-to-ack,” you will miss the most important incidents until customers complain.
Monitoring checklist
Use this to define dashboards, alerts, and incident playbooks for your webhook system.
- [ ] SLOs defined: ingest success, end-to-end latency, backlog age, DLQ rate
- [ ] Ingest metrics: request rate, 2xx/4xx/5xx, timeouts, signature failures
- [ ] Processing metrics: time-to-ack, retry attempts, retry reasons, dedupe hit rate
- [ ] Backlog metrics: queue depth, oldest message age, consumer lag
- [ ] DLQ metrics: DLQ volume, top error reasons, time in DLQ
- [ ] Alert rules:
- [ ] sustained non-2xx above baseline
- [ ] backlog age above threshold
- [ ] spike in signature failures
- [ ] DLQ above baseline
- [ ] Incident playbook:
- [ ] stop-the-bleed action (pause/scale down side effects)
- [ ] quarantine poison messages
- [ ] safe replay strategy after fix
- [ ] postmortem checklist (root cause + prevention) Reference implementation
A minimal approach to observability: structured logs + latency measurements around processing and acknowledgements.
Node
Structured logs + time-to-ack
// Worker instrumentation example: structured logs + latency measurement around processing and Ack/Nack
const QUEUE_NEXT_URL =
process.env.HOOQUE_QUEUE_NEXT_URL ??
"https://app.hooque.io/queues/cons_webhook_events/next";
const TOKEN = process.env.HOOQUE_TOKEN ?? "hq_tok_replace_me";
const headers = { Authorization: `Bearer ${TOKEN}` };
function log(obj) {
console.log(JSON.stringify({ ts: new Date().toISOString(), ...obj }));
}
async function processPayload(payload) {
// TODO: your real work here
return;
}
while (true) {
const start = Date.now();
const resp = await fetch(QUEUE_NEXT_URL, { headers });
if (resp.status === 204) {
log({ level: "info", msg: "queue empty" });
break;
}
if (!resp.ok) throw new Error(`Hooque next() failed: ${resp.status}`);
const payload = await resp.json();
const meta = JSON.parse(resp.headers.get("X-Hooque-Meta") ?? "{}");
try {
await processPayload(payload);
await fetch(meta.ackUrl, { method: "POST", headers });
log({ level: "info", msg: "acked", processing_ms: Date.now() - start });
} catch (err) {
await fetch(meta.nackUrl, {
method: "POST",
headers: { ...headers, "Content-Type": "application/json" },
body: JSON.stringify({ reason: String(err) }),
});
log({ level: "error", msg: "nacked", err: String(err), processing_ms: Date.now() - start });
}
} Python
Structured logs + time-to-ack
# Worker instrumentation example: structured logs + latency measurement around processing and Ack/Nack
import json
import os
import time
import requests
QUEUE_NEXT_URL = os.getenv(
"HOOQUE_QUEUE_NEXT_URL",
"https://app.hooque.io/queues/cons_webhook_events/next",
)
TOKEN = os.getenv("HOOQUE_TOKEN", "hq_tok_replace_me")
headers = {"Authorization": f"Bearer {TOKEN}"}
def log(obj: dict) -> None:
obj["ts"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
print(json.dumps(obj))
def process_payload(payload: dict) -> None:
# TODO: your real work here
return None
while True:
start = time.time()
resp = requests.get(QUEUE_NEXT_URL, headers=headers, timeout=30)
if resp.status_code == 204:
log({"level": "info", "msg": "queue empty"})
break
if resp.status_code >= 400:
raise RuntimeError(f"Hooque next() failed: {resp.status_code} {resp.text}")
payload = resp.json()
meta = json.loads(resp.headers.get("X-Hooque-Meta", "{}"))
try:
process_payload(payload)
requests.post(meta["ackUrl"], headers=headers, timeout=30)
log({"level": "info", "msg": "acked", "processing_ms": int((time.time() - start) * 1000)})
except Exception as err:
requests.post(
meta["nackUrl"],
headers={**headers, "Content-Type": "application/json"},
json={"reason": str(err)},
timeout=30,
)
log({"level": "error", "msg": "nacked", "err": str(err), "processing_ms": int((time.time() - start) * 1000)}) Common failure modes
Alerts should point to actions. Each failure mode should have a “what to do next” runbook.
Backlog age increasing
Likely causes
- Downstream dependency degraded.
- Retry storms amplify load.
- Worker concurrency too high or too low.
Next checks
- Check dependency error rates and latency.
- Add backoff + jitter and cap retries.
- Adjust worker concurrency and isolate hot tenants.
DLQ volume above baseline
Likely causes
- Schema drift / breaking changes.
- Auth/verification failures treated as retryable.
- Poison messages without quarantine.
Next checks
- Inspect top DLQ reasons and fix root cause.
- Reject permanent failures; don’t retry forever.
- Add replay workflow after fix.
Spike in signature failures
Likely causes
- Secret mismatch (test/prod).
- Body parsing change broke raw verification.
- Attack traffic hitting the endpoint.
Next checks
- Validate secret versions and rotation overlap.
- Capture raw bytes and verify offline.
- Rate limit and alert security on sustained spikes.
How Hooque helps
A webhook-to-queue system is easiest to monitor when the ingest and consumption interfaces are consistent and instrumentable.
- Durable ingest with provider verification so “accepted vs rejected” is clear.
- Queue semantics with explicit Ack/Nack/Reject outcomes to power alert rules.
- Metadata returned for each delivery (ready-to-call ack/nack/reject URLs) to simplify instrumentation.
- Dashboard inspection and replay to reduce MTTR after fixes.
- Per-webhook and per-consumer metrics to define SLOs and alerts.
For incident-triggered flows, see monitoring webhooks and review pricing.
FAQ
Practical monitoring questions for webhook-heavy systems.
What metrics matter for webhook monitoring?
Ingest success rate (2xx vs 4xx/5xx/timeouts), signature failures, retries/attempt counts, end-to-end processing latency (time-to-ack), backlog depth/age, and DLQ rate/volume. With Hooque, you get per-webhook and per-consumer visibility (received, queued, in-flight, delivered, rejected) to build alert rules and SLOs.
What SLOs should I set for webhooks?
Common SLOs include: ingest success rate, time-to-ack at p95/p99, maximum backlog age, and DLQ rate. Pick thresholds based on how quickly downstream systems must react. With Hooque, these map cleanly to queue health (backlog age, delivery outcomes) and consumer processing latency.
What alerts are most useful for webhook systems?
Alert on sustained non-2xx rates, increasing backlog age, spikes in signature failures, and DLQ volume above baseline. Avoid single-event alerts; focus on rates and sustained conditions. With Hooque, explicit Ack/Nack/Reject outcomes and queue backlog metrics give you stable alert signals.
How do I respond to a webhook incident?
Stop the bleed (pause/limit side effects), quarantine poison messages, fix the root cause, then replay safely. Use idempotency so replays do not duplicate side effects. With Hooque, you can inspect failed messages, Reject poison payloads with reasons, and replay after fixes from a durable history.
How do I know if idempotency is working?
Measure duplicate rate and dedupe hit rate. If you cannot quantify duplicates and deduped deliveries, you cannot prove correctness under retries. With Hooque, you can correlate deliveries and outcomes per consumer and implement dedupe using stable IDs from payload/meta.
What should I do when backlog grows?
Check if a dependency is degraded, reduce concurrency if retries are amplifying load, quarantine poison messages, and scale workers with bounded concurrency. Always track backlog age, not just depth. With Hooque, you can separate ingest durability from worker throughput and use queue backlog age as the primary signal.
Start processing webhooks reliably
Build reliable alerting and incident response on top of durable ingestion and explicit delivery outcomes.
No credit card required