What does 99.9% uptime (three nines) actually mean?

99.9% uptime (three nines) allows 8 hours, 45 minutes, and 36 seconds of downtime per year, or approximately 43.8 minutes per month. This means your API can be completely unavailable for nearly 44 minutes each month and still meet the SLA. For comparison, 99.99% (four nines) allows only 4.38 minutes per month, 99.95% allows 21.9 minutes, and 99% allows 7.3 hours. Each additional nine represents a 10x reduction in allowable downtime and typically requires significantly more investment in redundancy and monitoring.

How do I calculate SLA compliance for my API?

SLA compliance is calculated as: (total_minutes_in_period - downtime_minutes) / total_minutes_in_period * 100. For monthly SLA: a 30-day month has 43,200 minutes. If your API was down for 60 minutes, uptime = (43200 - 60) / 43200 * 100 = 99.861%. To track this accurately, you need continuous monitoring with checks at least every minute, a clear definition of what constitutes 'down' (HTTP 5xx, timeout, latency above threshold), and an incident tracking system that records precise start and end times for each outage.

What is the cost of API downtime?

API downtime cost depends on what the API enables. For e-commerce APIs, downtime directly equals lost revenue: if your store processes $10,000/hour and the payment API is down for 30 minutes, the direct cost is $5,000. For webhook-dependent systems, downtime causes missed events that may require manual reconciliation at $50-200/hour of engineering time. For SaaS APIs with SLA commitments, downtime triggers service credits (typically 10-25% of monthly fees per SLA breach). The total cost includes direct revenue loss, engineering recovery time, customer trust erosion, and contractual penalties.

How often should I check API endpoint health?

Check frequency depends on your SLA requirements and acceptable detection latency. For 99.9% SLA (43.8 min/month budget), checking every 1 minute is sufficient — you detect outages within 1 minute, leaving the full downtime budget for actual resolution. For 99.99% SLA (4.38 min/month), check every 10-15 seconds to minimize detection delay. For 99.999% SLA (26 seconds/month), use real-time health checks every 5 seconds from multiple locations. More frequent checks consume more monitoring resources but reduce Mean Time To Detect (MTTD).

What response time percentiles should I monitor?

Monitor p50 (median), p95, p99, and p99.9 percentiles. The median shows typical user experience. p95 reveals the experience of your slowest 5% of requests — often caused by cold starts, cache misses, or database contention. p99 catches rare but impactful slowdowns affecting 1 in 100 requests. p99.9 identifies extreme outliers that might indicate infrastructure issues. For webhook endpoints, p99 is the most critical metric because webhook providers typically use aggressive timeouts (5-30 seconds), and the slowest 1% of requests are the ones most likely to trigger timeout-based retries.

API Monitor Dashboard — Uptime Simulator with SLA Calculator

May 25, 2026 · 14 min read · By Michael Lip

API uptime is the foundation of every webhook integration, every automation pipeline, and every real-time data flow. When an API goes down, webhooks stop firing, polling returns errors, automation chains break, and downstream systems lose data. The API Monitor Dashboard simulates endpoint monitoring over a configurable time period, letting you explore how different uptime levels translate to actual minutes of downtime, SLA compliance outcomes, and financial impact. Configure your target SLA, inject simulated incidents, and see exactly how close your uptime budget is to exhaustion.

Set your monitoring parameters — check interval, SLA target, response time threshold, and hourly revenue impact — and the simulator generates a realistic 30-day monitoring timeline with randomized incidents based on your target uptime. The uptime bar shows every day color-coded by health status. Response time distribution reveals latency patterns. The SLA compliance table compares your simulated performance against standard industry tiers. Use this tool to understand the operational realities behind uptime numbers and plan your monitoring strategy accordingly.

API Uptime Monitor Simulator

Scenario Presets

Target Uptime SLA (%)

Check Interval (seconds)

Response Time Threshold (ms)

Hourly Revenue Impact ($)

Monitoring Period

Incident Severity Mix

Actual Uptime

99.95%

simulated

Total Downtime

21 min

across all incidents

SLA Budget Used

48%

of allowed downtime

Incidents

during period

Revenue Impact

$175

estimated loss

Avg Response Time

145 ms

p50 median

Uptime Timeline (each block = 1 day)

Day 1Day 30

Response Time Distribution (simulated checks)

FastSlow

SLA Tier Compliance

SLA Tier	Uptime	Allowed Downtime/Mo	Status

Incident Log

Understanding API Uptime and SLA Compliance

Uptime is expressed as a percentage of total time that a system is operational and accessible. The difference between 99% and 99.99% uptime sounds negligible — less than one percentage point — but in practice it represents a 100x difference in allowable downtime. At 99% uptime, your API can be down for 7.3 hours per month. At 99.99%, you have only 4.38 minutes. This exponential relationship between nines and downtime is the foundation of SLA engineering and directly determines how much you need to invest in redundancy, monitoring, and incident response.

Service Level Agreements (SLAs) formalize uptime commitments as contractual obligations with financial penalties for breaches. Most cloud providers offer 99.9% to 99.99% uptime SLAs with service credits (typically 10–25% of monthly fees) when targets are missed. The SLA itself does not prevent downtime — it merely defines the consequences. Actual reliability requires engineering investment in redundant infrastructure, automated failover, health monitoring, and rapid incident response. The monitoring dashboard above simulates these dynamics so you can understand the operational realities behind uptime numbers.

The Nines Table: Downtime by SLA Tier

The standard reference for uptime engineering is the nines table, which converts uptime percentages to concrete downtime budgets. Two nines (99%) allows 3.65 days of downtime per year, or about 7.3 hours per month. This is acceptable for internal tools and non-critical batch processing. Three nines (99.9%) allows 8 hours 45 minutes per year, or 43.8 minutes per month. This is the standard for most SaaS APIs and webhook endpoints. Four nines (99.99%) allows 52.6 minutes per year, or 4.38 minutes per month. This requires redundant infrastructure, automated failover, and sub-minute incident detection. Five nines (99.999%) allows 5.26 minutes per year and requires fully automated recovery with no human intervention in the critical path.

Each additional nine roughly requires a 10x increase in infrastructure investment and operational maturity. Moving from three nines to four nines typically means deploying across multiple availability zones, implementing active-active load balancing, adding automated health checks with sub-second detection, and building self-healing systems that can recover without human intervention. The cost curve is exponential: three nines might cost $500/month in infrastructure, four nines might cost $5,000, and five nines might cost $50,000 or more. Teams managing production API infrastructure must make explicit decisions about which tier to target based on business impact analysis.

Monitoring Strategy and Check Frequency

The check frequency of your monitoring system determines how quickly you detect outages, which directly affects your Mean Time To Detect (MTTD). If you check every 60 seconds, your MTTD for a complete outage is at most 60 seconds (one missed check). For a three-nines SLA with a 43.8-minute monthly budget, a 1-minute detection delay consumes only 2.3% of your budget. For a four-nines SLA with a 4.38-minute budget, the same 1-minute delay consumes 22.8% — nearly a quarter of your entire monthly allowance is spent just detecting the problem.

Multi-location monitoring adds reliability to detection. A single monitoring probe might report a false positive due to network issues between the probe and your API. Monitoring from 3–5 geographically distributed locations and requiring at least 2 to report failure before triggering an alert eliminates most false positives while adding minimal detection delay (one additional check interval). This approach also detects regional outages that might not be visible from a single location, which is critical for globally distributed API infrastructure.

Response Time Distribution and Percentiles

Uptime measures availability (is the API responding at all?), but response time measures quality (how fast is it responding?). An API that returns 200 OK after 15 seconds is technically "up" but practically unusable for webhook processing, where providers typically time out after 5–30 seconds. Response time monitoring should track percentiles rather than averages: the average might be 150 ms, but the p99 might be 3,000 ms, meaning 1 in 100 requests takes 20x longer than typical.

The response time distribution typically follows a log-normal pattern: most requests cluster at the lower end (fast responses), with a long tail of increasingly slow responses. Cold starts, garbage collection pauses, database connection pool exhaustion, and cache misses all contribute to tail latency. The p99 and p99.9 percentiles capture these tail events, which are precisely the requests most likely to cause webhook delivery failures. Monitoring should trigger alerts when p99 latency exceeds your webhook provider's timeout threshold, not just when the average response time increases.

Incident Classification and Impact

Major incidents are complete outages where the API returns 5xx errors or does not respond at all. These are immediately visible to all consumers and typically require on-call engineering intervention. Major incidents consume your downtime budget rapidly and often trigger SLA breach notifications.

Minor incidents are partial degradations where the API responds but with elevated error rates or increased latency. These might affect only certain endpoints, certain geographic regions, or a percentage of requests. Minor incidents are harder to detect and can persist longer before triggering alerts, making them disproportionately damaging to cumulative uptime numbers.

Performance degradations are latency increases without errors. The API returns correct responses but slowly. These do not technically count as downtime in most SLA definitions but can still cause webhook delivery failures if response times exceed provider timeouts. Some SLAs include latency requirements (e.g., p95 response time under 500 ms) alongside availability requirements, creating a more complete quality measurement.

Downtime Cost Estimation

The financial impact of API downtime has both direct and indirect components. Direct costs include lost revenue (transactions that cannot complete), SLA penalty credits, and engineering time for incident response and recovery. Indirect costs include customer churn, reputational damage, and the operational debt from accumulated data inconsistencies that require manual reconciliation after recovery.

For webhook-dependent systems, the recovery cost often exceeds the downtime cost. When webhooks fail to deliver during an outage, the events accumulate in the provider's retry queue. When the API recovers, a burst of retried webhooks arrives simultaneously, potentially overwhelming the system and causing a secondary outage. Careful capacity planning for post-recovery traffic spikes is essential, and many teams implement exponential backoff on the consumer side to smooth out the retry storm.

Last updated: May 25, 2026