Webhook Monitor Dashboard — Track Deliveries & Debug Failures
Webhook delivery is a fire-and-forget operation by design, but production systems need visibility into exactly what was sent, what was received, and what failed. This simulated webhook monitor dashboard replicates the core UI of a real delivery monitoring system: a live event timeline with status indicators, filter and search controls, per-event drill-down into request headers and payload body, response status and body, latency measurements, and aggregate statistics including success rate, average latency, and a breakdown of failure reasons.
Click any event row to inspect the full delivery detail — headers, request body, HTTP response code, and response body. Use the filter buttons to narrow the view to only failed or retried events. Use the search box to find events by event type or payload content. The stats panel at the top updates dynamically as you filter. Use the Re-Simulate button to generate a fresh batch of 40 simulated deliveries.
Understanding Webhook Delivery Monitoring
A webhook delivery attempt consists of an HTTP POST from the sending system (provider) to your endpoint (consumer). Monitoring this delivery requires capturing at minimum: the event type, the timestamp of the attempt, the HTTP status code returned by your endpoint, the response time, the request payload, and the response body. These six data points let you reconstruct exactly what happened for every event and diagnose failures without guesswork.
Most webhook providers expose a delivery log in their dashboard, but the granularity varies significantly. Stripe, for example, shows every delivery attempt with headers, payload, response status, and response body for the last 30 days. GitHub shows the last 250 deliveries per webhook. Custom webhook systems built in-house often have no delivery visibility at all, which makes debugging failures require reconstructing the sequence from application logs — a slow and error-prone process that an embedded delivery monitor eliminates entirely.
Delivery Status States
Success means your endpoint returned a 2xx HTTP status code within the provider's timeout window (typically 5–30 seconds). The provider considers the event delivered and will not retry it. Any processing you do after returning 200 OK is outside the delivery contract. If that processing fails, the event is lost from the provider's perspective — which is why queuing webhook events internally before processing is the correct architecture for reliable systems.
Retry means the previous delivery attempt failed — either a non-2xx response, a timeout, or a connection error — and the provider is scheduling another attempt. Retry schedules vary by provider: Stripe retries on an exponential backoff schedule (1 hour, 2 hours, 4 hours, 8 hours, 16 hours, 24 hours) for up to 3 days. GitHub retries immediately and then gives up. Understanding your provider's retry behavior is essential for planning your incident response: if a Stripe webhook fails at 2 AM, you have up to 3 days before events are permanently lost, while a GitHub webhook failure has no automatic recovery.
Failed means all delivery attempts have been exhausted without a successful response. The event is permanently lost from the automatic delivery queue. Recovery requires either manually re-triggering the event in the provider's dashboard (if available) or reprocessing from your own event store. This is why idempotency in webhook handlers is not just a best practice but a recovery requirement: if you receive the same event twice (once from automatic retry, once from manual re-trigger), your system must produce the same result both times without duplicating side effects like charging a customer or sending an email.
Common Failure Reasons and How to Debug Them
HTTP 500 / Application Error is the most common failure type in production. Your endpoint received the webhook, began processing it, and threw an unhandled exception before returning a response. The fix is to add a top-level try/catch in your webhook handler that catches all exceptions and returns 200 OK before processing begins, or better, to queue the event immediately on receipt and process asynchronously so exceptions in processing do not affect delivery acknowledgment.
Timeout occurs when your endpoint takes longer than the provider's timeout limit to return any HTTP response. Common causes include synchronous database queries, calling external APIs inside the handler, acquiring locks, or cold-start latency in serverless functions. The fix is to return 200 OK immediately (acknowledge receipt) and process the event asynchronously in a background worker or message queue. This pattern decouples delivery reliability from processing reliability.
HTTP 401 / Signature Validation Failure indicates your endpoint is rejecting the webhook because signature verification failed. This can happen when the raw request body is parsed before signature verification (many frameworks do this automatically, corrupting the bytes used for HMAC computation), when the secret is rotated without updating the verification logic, or when a proxy modifies headers or body in transit. Always verify signatures against the raw body bytes, not a parsed representation.
Connection Refused / DNS Failure means the provider cannot reach your endpoint at all. Causes include your server being down, firewall rules blocking the provider's IP ranges, a misconfigured DNS record, or deploying a configuration change that breaks the webhook URL. Monitoring the connection refused rate separately from the application error rate helps distinguish infrastructure problems from application problems.
HTTP 429 / Rate Limited occurs when your endpoint returns a rate limit response to the webhook provider. This is particularly common after outages: when a backlog of retried webhooks arrives simultaneously after recovery, providers often resend them at the original rate, which can overwhelm your processing capacity. Implement a burst-tolerant ingestion layer (a queue in front of your processor) to absorb these traffic spikes without rate limiting the delivery.
What to Monitor: Key Metrics
The three most important webhook delivery metrics are success rate (percentage of delivery attempts returning 2xx), p99 response latency (time to return any response, which must stay below the provider's timeout), and failure reason distribution (what percentage of failures are each type). Success rate below 99% indicates a systemic problem. p99 latency above 50% of the provider timeout is a warning sign. A spike in a specific failure reason (e.g., suddenly 80% of failures are 500 errors) points directly at the root cause.
Secondary metrics include retry depth (what percentage of successful deliveries required more than one attempt), time to first retry (how long events sit in failed state before the provider retries), and event type distribution (which event types fail at higher rates, often revealing that certain event shapes trigger handler bugs). A well-instrumented monitoring system captures all of these dimensions and surfaces anomalies in real time.