Webhook Monitor Dashboard — Track Deliveries & Debug Failures

May 28, 2026 · 12 min read · By Michael Lip

Webhook delivery is a fire-and-forget operation by design, but production systems need visibility into exactly what was sent, what was received, and what failed. This simulated webhook monitor dashboard replicates the core UI of a real delivery monitoring system: a live event timeline with status indicators, filter and search controls, per-event drill-down into request headers and payload body, response status and body, latency measurements, and aggregate statistics including success rate, average latency, and a breakdown of failure reasons.

Click any event row to inspect the full delivery detail — headers, request body, HTTP response code, and response body. Use the filter buttons to narrow the view to only failed or retried events. Use the search box to find events by event type or payload content. The stats panel at the top updates dynamically as you filter. Use the Re-Simulate button to generate a fresh batch of 40 simulated deliveries.

Webhook Delivery Monitor (Simulated)
Success Rate
--
of deliveries
Total Events
--
in window
Avg Latency
--
ms (successful)
Failed
--
need attention
Retrying
--
pending retry
Top Failure
--
primary cause


Failure Reason Breakdown

Understanding Webhook Delivery Monitoring

A webhook delivery attempt consists of an HTTP POST from the sending system (provider) to your endpoint (consumer). Monitoring this delivery requires capturing at minimum: the event type, the timestamp of the attempt, the HTTP status code returned by your endpoint, the response time, the request payload, and the response body. These six data points let you reconstruct exactly what happened for every event and diagnose failures without guesswork.

Most webhook providers expose a delivery log in their dashboard, but the granularity varies significantly. Stripe, for example, shows every delivery attempt with headers, payload, response status, and response body for the last 30 days. GitHub shows the last 250 deliveries per webhook. Custom webhook systems built in-house often have no delivery visibility at all, which makes debugging failures require reconstructing the sequence from application logs — a slow and error-prone process that an embedded delivery monitor eliminates entirely.

Delivery Status States

Success means your endpoint returned a 2xx HTTP status code within the provider's timeout window (typically 5–30 seconds). The provider considers the event delivered and will not retry it. Any processing you do after returning 200 OK is outside the delivery contract. If that processing fails, the event is lost from the provider's perspective — which is why queuing webhook events internally before processing is the correct architecture for reliable systems.

Retry means the previous delivery attempt failed — either a non-2xx response, a timeout, or a connection error — and the provider is scheduling another attempt. Retry schedules vary by provider: Stripe retries on an exponential backoff schedule (1 hour, 2 hours, 4 hours, 8 hours, 16 hours, 24 hours) for up to 3 days. GitHub retries immediately and then gives up. Understanding your provider's retry behavior is essential for planning your incident response: if a Stripe webhook fails at 2 AM, you have up to 3 days before events are permanently lost, while a GitHub webhook failure has no automatic recovery.

Failed means all delivery attempts have been exhausted without a successful response. The event is permanently lost from the automatic delivery queue. Recovery requires either manually re-triggering the event in the provider's dashboard (if available) or reprocessing from your own event store. This is why idempotency in webhook handlers is not just a best practice but a recovery requirement: if you receive the same event twice (once from automatic retry, once from manual re-trigger), your system must produce the same result both times without duplicating side effects like charging a customer or sending an email.

Common Failure Reasons and How to Debug Them

HTTP 500 / Application Error is the most common failure type in production. Your endpoint received the webhook, began processing it, and threw an unhandled exception before returning a response. The fix is to add a top-level try/catch in your webhook handler that catches all exceptions and returns 200 OK before processing begins, or better, to queue the event immediately on receipt and process asynchronously so exceptions in processing do not affect delivery acknowledgment.

Timeout occurs when your endpoint takes longer than the provider's timeout limit to return any HTTP response. Common causes include synchronous database queries, calling external APIs inside the handler, acquiring locks, or cold-start latency in serverless functions. The fix is to return 200 OK immediately (acknowledge receipt) and process the event asynchronously in a background worker or message queue. This pattern decouples delivery reliability from processing reliability.

HTTP 401 / Signature Validation Failure indicates your endpoint is rejecting the webhook because signature verification failed. This can happen when the raw request body is parsed before signature verification (many frameworks do this automatically, corrupting the bytes used for HMAC computation), when the secret is rotated without updating the verification logic, or when a proxy modifies headers or body in transit. Always verify signatures against the raw body bytes, not a parsed representation.

Connection Refused / DNS Failure means the provider cannot reach your endpoint at all. Causes include your server being down, firewall rules blocking the provider's IP ranges, a misconfigured DNS record, or deploying a configuration change that breaks the webhook URL. Monitoring the connection refused rate separately from the application error rate helps distinguish infrastructure problems from application problems.

HTTP 429 / Rate Limited occurs when your endpoint returns a rate limit response to the webhook provider. This is particularly common after outages: when a backlog of retried webhooks arrives simultaneously after recovery, providers often resend them at the original rate, which can overwhelm your processing capacity. Implement a burst-tolerant ingestion layer (a queue in front of your processor) to absorb these traffic spikes without rate limiting the delivery.

What to Monitor: Key Metrics

The three most important webhook delivery metrics are success rate (percentage of delivery attempts returning 2xx), p99 response latency (time to return any response, which must stay below the provider's timeout), and failure reason distribution (what percentage of failures are each type). Success rate below 99% indicates a systemic problem. p99 latency above 50% of the provider timeout is a warning sign. A spike in a specific failure reason (e.g., suddenly 80% of failures are 500 errors) points directly at the root cause.

Secondary metrics include retry depth (what percentage of successful deliveries required more than one attempt), time to first retry (how long events sit in failed state before the provider retries), and event type distribution (which event types fail at higher rates, often revealing that certain event shapes trigger handler bugs). A well-instrumented monitoring system captures all of these dimensions and surfaces anomalies in real time.

Frequently Asked Questions

Why did my webhook fail even though my server was running?

The most common cause is a timeout: your handler took longer than the provider's allowed response window (usually 5–30 seconds) to return any HTTP response. Fix by acknowledging receipt immediately with 200 OK and processing the event asynchronously. Other causes include signature validation failures, unhandled exceptions returning 500, or a firewall blocking the provider's IP.

How do I recover lost webhook events after an outage?

Recovery options depend on the provider. Stripe offers a "Resend" button in the webhook delivery log and also supports backfilling missed events via the Events API. GitHub lets you redeliver failed webhook deliveries from the settings page. For providers without manual redeliver, you must reconcile by comparing your database state against the provider's API. Build an idempotent reconciliation job that fetches recent events from the provider API and re-applies any that are missing from your system.

What HTTP status code should my webhook endpoint return?

Return HTTP 200 OK as quickly as possible, ideally within 1 second. Any 2xx code (200, 201, 202, 204) signals success to the provider. Return 200 before doing any processing — just acknowledge receipt. If you return 4xx or 5xx, or if you timeout, the provider will retry the delivery. Never return 400 for business logic errors in the payload content, as this will trigger unnecessary retries.

How does webhook signature verification work?

The provider computes an HMAC-SHA256 hash of the raw request body using your shared webhook secret as the key, then includes this hash in a request header (e.g., Stripe-Signature, X-Hub-Signature-256). Your endpoint recomputes the same hash and compares it. Always compute the hash from the raw bytes of the request body before any parsing. If your framework auto-parses JSON before your handler runs, reconstruct the raw body from the parsed object — this often introduces subtle differences that break signature verification.

How many retries do webhook providers attempt?

Retry policies vary significantly by provider. Stripe retries failed webhooks on exponential backoff for up to 3 days (attempts at roughly 1h, 2h, 4h, 8h, 16h, 24h intervals). GitHub makes one delivery attempt and does not retry automatically. Shopify retries up to 19 times over 48 hours. Twilio retries up to 3 times. Always check your provider's documentation for the exact retry policy and design your recovery procedures accordingly.

Related Tools