Deep DiveFeb 2025·6 min read

Understanding false positives in uptime monitoring

False positives — alerts that fire when everything is actually fine — are one of the biggest problems in uptime monitoring. Here's what causes them and how to eliminate them.

A false positive in uptime monitoring is an alert that fires when your service is actually up and running normally. Your phone buzzes at 2am, you scramble to check your dashboards, everything is green, and you go back to sleep frustrated. False positives are more than annoying — they train your team to ignore alerts, which means real incidents get missed.

What causes false positives

Most false positives originate from a single root cause: a check from a single location. When your monitor is one server in one data center pinging your URL, any network issue between that server and your infrastructure looks like an outage — even if your service is fine and accessible to real users.

  • Network blips between the monitoring server and your CDN edge node
  • Temporary DNS resolution failures at the probe's ISP
  • TCP connection timeouts caused by congestion on a specific network path
  • Short-lived SSL handshake failures due to TLS version negotiation
  • CDN routing issues that affect one geographic path but not others

None of these mean your service is down. But a single-region monitor treats all of them as outages and fires an alert.

The single-region problem

Imagine your site is hosted on AWS CloudFront. Your monitor is in Virginia. A brief network issue between the monitoring provider's Virginia server and the CloudFront edge in Virginia causes a 30-second window where checks fail. Your service is perfectly accessible from Chicago, London, and Tokyo — but your monitor fires a DOWN alert.

This is not hypothetical. Single-region monitoring services report false positive rates of 5–15% in their own benchmarks. For a service checking every minute, that's potentially several false alerts per day.

How consensus verification works

The solution is to never trust a single check result. When a failure is detected, re-verify from a different location before alerting. If multiple independent probes agree the service is down, it almost certainly is. If they disagree, it was a probe-side network issue.

This is how UptimeWiz works:

  1. Checker Lambda detects a failure from its region
  2. Instead of alerting immediately, it sends the failure report to a VerificationQueue
  3. Consensus Lambda picks up the report and simultaneously re-checks from the other 2 regions
  4. Only if 2 or more regions total confirm the failure is an incident created and an alert sent
  5. If any re-check succeeds, the original failure is logged as a transient blip and discarded

The latency trade-off

Consensus verification adds a small amount of detection latency — typically 10–15 seconds for the re-check cycle. This is a deliberate trade-off: slightly slower detection in exchange for near-zero false positives.

For most production services, this is the right trade-off. An alert that fires 15 seconds later but is always real is far more valuable than instant alerts that are wrong 10% of the time. Alert fatigue is a silent killer of operational reliability.

What doesn't get filtered

Consensus verification filters network-level false positives but intentionally preserves genuine outages. If your service is actually down — your origin server is returning 500s, your database is unreachable, your deployment broke the application — all probes will fail, and the alert fires correctly within the re-check window.

The goal is not the fastest alert. The goal is the most trustworthy alert. Your team should respond to every notification with confidence, not skepticism.

← Back to all posts

Start monitoring today

Set up your first monitor in under 2 minutes.

Invite only — coming soon