Understanding false positives in uptime monitoring

A false positive in uptime monitoring is an alert that fires when your service is actually up and running normally. Your phone buzzes at 2am, you scramble to check your dashboards, everything is green, and you go back to sleep frustrated. False positives are more than annoying — they train your team to ignore alerts, which means real incidents get missed.

What causes false positives

Most false positives originate from a single root cause: a check from a single location. When your monitor is one server in one data center pinging your URL, any network issue between that server and your infrastructure looks like an outage — even if your service is fine and accessible to real users.

Network blips between the monitoring server and your CDN edge node
Temporary DNS resolution failures at the probe's ISP
TCP connection timeouts caused by congestion on a specific network path
Short-lived SSL handshake failures due to TLS version negotiation
CDN routing issues that affect one geographic path but not others

None of these mean your service is down. But a single-region monitor treats all of them as outages and fires an alert.

The single-region problem

Imagine your site is hosted on AWS CloudFront. Your monitor is in Virginia. A brief network issue between the monitoring provider's Virginia server and the CloudFront edge in Virginia causes a 30-second window where checks fail. Your service is perfectly accessible from Chicago, London, and Tokyo — but your monitor fires a DOWN alert.

This is not hypothetical. Single-region monitoring services report false positive rates of 5–15% in their own benchmarks. For a service checking every minute, that's potentially several false alerts per day.

How consensus verification works

The solution is to never trust a single check result. When a failure is detected, re-verify from a different location before alerting. If multiple independent probes agree the service is down, it almost certainly is. If they disagree, it was a probe-side network issue.

This is how UptimeWiz works:

Checker Lambda detects a failure from its region
Instead of alerting immediately, it sends the failure report to a VerificationQueue
Consensus Lambda picks up the report and simultaneously re-checks from the other 2 regions
Only if 2 or more regions total confirm the failure is an incident created and an alert sent
If any re-check succeeds, the original failure is logged as a transient blip and discarded

The latency trade-off

Consensus verification adds a small amount of detection latency — typically 10–15 seconds for the re-check cycle. This is a deliberate trade-off: slightly slower detection in exchange for near-zero false positives.

For most production services, this is the right trade-off. An alert that fires 15 seconds later but is always real is far more valuable than instant alerts that are wrong 10% of the time. Alert fatigue is a silent killer of operational reliability.

What doesn't get filtered

Consensus verification filters network-level false positives but intentionally preserves genuine outages. If your service is actually down — your origin server is returning 500s, your database is unreachable, your deployment broke the application — all probes will fail, and the alert fires correctly within the re-check window.