Health Checks¶
A check is a probe attached to an entity. The generator schedules checks on their configured intervals; workers execute them, record results, and trigger state transitions when failures accumulate past the threshold.
Check types¶
HTTP¶
Performs an HTTP/HTTPS GET (or configurable method) to a URL.
| Parameter | Default | Description |
|---|---|---|
target_url |
— | Full URL including scheme |
expected_status |
200 |
HTTP status code that counts as success |
body_contains |
"" |
Substring that must appear in the response body |
interval_seconds |
60 |
How often to run |
timeout_ms |
5000 |
Request timeout in milliseconds |
failure_threshold |
3 |
Consecutive failures before state change |
TCP¶
Opens a TCP connection to a host:port pair.
TLS¶
Like TCP, but also validates the TLS certificate (expiry and hostname).
DNS¶
Resolves a hostname and optionally validates the record type and expected value.
{
"check_type": "dns",
"config": {
"hostname": "example.com",
"record_type": "A",
"expected_value": "93.184.216.34"
}
}
Push¶
A push check inverts the relationship: instead of Wanepia reaching out to your service, your agent posts results to Wanepia. Use push checks for services that are unreachable from the public internet — databases behind a firewall, internal Kubernetes workloads, services on a private VPN, or anything that cannot accept inbound connections.
Push checks have no target_url; the worker never executes them. Your agent owns the probe logic and calls Wanepia when it has a result.
Push checks¶
Creating a push check¶
The response includes the check's id. Save it — you will use it in the push endpoint.
Posting a result¶
Call this endpoint from your agent each time it runs a probe:
Request body:
| Field | Required | Description |
|---|---|---|
success |
Yes | true if the probe passed, false if it failed |
latency_ms |
Yes | How long the probe took, in milliseconds |
error_message |
No | Human-readable failure detail (shown in the UI) |
checked_at |
No | ISO 8601 timestamp — defaults to now if omitted |
status_code |
No | Numeric code (e.g. HTTP status) if meaningful |
curl -X POST https://api.wanepia.com/v1/checks/$CHECK_ID/results \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"success": true,
"latency_ms": 45,
"error_message": ""
}'
Timestamp constraints
checked_at cannot be more than 1 minute in the future, and cannot be older than your account's retention window. Posting to a pull-mode check returns 409 Conflict.
Staleness detection¶
If your agent stops posting, Wanepia marks the check as stale when:
The dashboard shows a warning banner on stale push checks so you can tell the difference between "the probe ran and passed" and "the probe stopped running entirely".
Set a short interval
Even if your agent runs every 5 minutes, set interval_seconds: 300 so the staleness threshold (10 minutes) matches the expected cadence.
Failure threshold guidance¶
Push agents typically run infrequently. A failure_threshold of 3 would mean three consecutive failed runs — potentially 15 minutes of real downtime — before an alert fires. For push checks, set failure_threshold: 1 so a single failure triggers an immediate alert.
Agent implementation patterns¶
Shell script — Postgres reachability¶
Drop this script on any host inside your network that can reach the database. Run it from cron or a Kubernetes CronJob.
#!/usr/bin/env bash
set -euo pipefail
WANEPIA_TOKEN="$WANEPIA_TOKEN"
CHECK_ID="$CHECK_ID"
DB_HOST="${DB_HOST:-localhost}"
DB_PORT="${DB_PORT:-5432}"
start=$(date +%s%3N)
if pg_isready -h "$DB_HOST" -p "$DB_PORT" -q; then
end=$(date +%s%3N)
latency=$(( end - start ))
payload='{"success":true,"latency_ms":'"$latency"',"error_message":""}'
else
end=$(date +%s%3N)
latency=$(( end - start ))
payload='{"success":false,"latency_ms":'"$latency"',"error_message":"pg_isready: host unreachable or refusing connections"}'
fi
curl -s -X POST "https://api.wanepia.com/v1/checks/$CHECK_ID/results" \
-H "Authorization: Bearer $WANEPIA_TOKEN" \
-H "Content-Type: application/json" \
-d "$payload"
Run it as a cron job every minute:
Kubernetes CronJob
Wrap the script in a lightweight container image and deploy a CronJob with schedule: "* * * * *" to probe services inside your cluster without exposing them to the internet.
Creating a check¶
# HTTP check
wnp checks create \
--entity <entity-id> \
--type http \
--url https://api.example.com/health \
--interval 30 \
--status 200 \
--body '"status":"ok"' \
--threshold 2
# TCP check (via API — CLI uses --url for the target)
wnp checks create \
--entity <entity-id> \
--type tcp \
--url db.internal:5432 \
--interval 60
curl -X POST https://api.wanepia.com/v1/checks \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"entity_id": "...",
"check_type": "http",
"target_url": "https://api.example.com/health",
"interval_seconds": 30,
"timeout_ms": 3000,
"expected_status": 200,
"body_contains": "ok",
"failure_threshold": 2,
"enabled": true
}'
Managing checks¶
# List all checks
wnp checks list
# Inspect a single check (prefix is enough)
wnp checks get a1b2
# Disable without deleting
wnp checks disable a1b2
# Re-enable
wnp checks enable a1b2
# Update interval
wnp checks update a1b2 --interval 120
# Delete
wnp checks delete a1b2
Viewing results¶
STATUS LATENCY CHECKED AT ERROR
✓ 200 42ms 2024-01-20 09:15
✓ 200 38ms 2024-01-20 09:14
✗ 0 — 2024-01-20 09:13 connection refused
Failure threshold and state transitions¶
The failure_threshold prevents flapping. A check must fail that many times consecutively before the entity's status changes. A single success resets the failure counter.
attempt 1: fail (counter: 1/3)
attempt 2: fail (counter: 2/3)
attempt 3: fail (counter: 3/3) → entity transitions to down
attempt 4: ok → entity transitions to up, counter resets
See State Machine for degraded vs. down logic.
Check limits¶
The number of checks per entity is governed by your plan limit: