SLA Monitoring
SLA is measured from real agent data using 5-minute snapshots. The aggregator runs every 5 minutes on a 15-minute delay — i.e. at 12:05 it processes the 11:45–11:50 window. The delay ensures every agent in every cluster has had time to report for that window (agents sync every 30 s). For dev/test environments you can trigger the daily report immediately with POST /api/v1/sla/report/generate instead of waiting for 07:00.
How it works
| Component | Interval | Function |
|---|---|---|
| Agent Sync | 30s | Reports service status (OPERATIONAL/DEGRADED/DOWN) + writes sla_snapshots |
| Aggregator | 5 min | Buckets snapshots into 5-min windows, calculates uptime % (auto-starts) |
| Period Results | Per aggregation | Daily + monthly uptime % calculated and stored |
| Daily Report | 07:00 daily | Generates JSON + PDF report, pushes to SLA Portal |
| Cleanup | 1 hour | Deletes aggregated snapshots older than 90 days |
sla_snapshots table not empty)2. Services have
workloadType + workloadName in ConfigMap (status != UNKNOWN)3. SLA definitions + service assignments exist (auto-created from tier config)
4. Wait ~20 minutes for first aggregation cycle (15-min delay + 5-min interval)
SLA Tiers
| Tier | Uptime Target | Response Time | Resolution Time |
|---|---|---|---|
| Critical | 99.99% | 15 min | 4 hours |
| High | 99.9% | 60 min | 8 hours |
| Medium | 99.5% | 4 hours | 3 days |
| Low | 99.0% | 24 hours | 5 days |
Backup Monitoring
ITOps tracks backup status for services that have backup.expected: true in their ConfigMap. The Backup tab shows the last backup time, status, and alerts if a backup is overdue.
How it works (2 steps)
- GitOps (ConfigMap) — Define
operations.backup.expected: trueandmaxAgeDaysin the service ConfigMap. The agent registers it and it appears in the Backup tab. - Push (Webhook) — Your backup CronJob/script sends a completion report via
POST /api/v1/backup/report. If the service has no ConfigMap, the first push auto-creates it.
If no backup report arrives within maxAgeDays, the service shows as Stale (overdue) in the Backup tab.
Required fields
Only service (or slaGroup/namespace for group-level reports) is strictly required. nodeId is not rejected when missing — instead the backend places the service under an "unknown" hierarchy node with a red badge in the UI. This is deliberate: it makes misconfigurations visible instead of silently dropping data.
Always pass a real nodeId (organization/platform/environment/cluster) from cronjobs. If you see a service under the red "unknown" node in Operations Catalog, the push source is missing nodeId — fix the cronjob and re-apply.
Webhook Call
# Service-level report (nodeId recommended for correct placement)
curl -X POST https://api.yourdomain.com/api/v1/backup/report \
-H "X-API-Key: $OPERATOR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"service":"my-database","nodeId":"myorg/platform/prod/cluster1","status":"success","sizeBytes":5242880}'
# SLA Group-level report (propagates to all services with backup.expected=true)
curl -X POST https://api.yourdomain.com/api/v1/backup/report \
-H "X-API-Key: $OPERATOR_API_KEY" \
-d '{"slaGroup":"payment-system","status":"success"}'
# Namespace-level report
curl -X POST https://api.yourdomain.com/api/v1/backup/report \
-H "X-API-Key: $OPERATOR_API_KEY" \
-d '{"namespace":"production","status":"success"}'
Storage Monitoring
The Operations Catalog has a Storage tab that shows disk usage for database and cache services.
How it works (2 options)
- GitOps (ConfigMap) + Push — Define the service with storage tags in a ConfigMap. The agent registers it (no usage data yet). Push metrics via
POST /api/v1/storage/reportwith anodeIdthat matches the hierarchy — the handler updates the existing service row. - Push only (no ConfigMap) — Send storage metrics for a service that was never declared in a ConfigMap. The handler auto-creates the service row (source=external, tags=[storage], criticality=medium) on the first push. Ideal for bare-metal disks, external S3 buckets, RDS instances.
Required fields
Only service is strictly required. nodeId is not rejected when missing — instead the service appears under an "unknown" hierarchy node with a red badge, making the misconfig visible rather than silently dropping data. Always pass a real nodeId (organization/platform/environment/cluster) from cronjobs so the service lands under the correct cluster.
At least one of freePercent OR (allocatedBytes + usedBytes) must be present for the dashboard to show numbers; if both are omitted, the handler assumes 100% free.
Required Tags
A service appears in the Storage tab if it has any of these tags: storage, s3, ebs, database, rds, elasticache, cache. Only the storage endpoint auto-appends the storage tag to the service row — health and backup endpoints never modify tags, so a service that you want on the Storage tab must either declare the tag in its ConfigMap OR receive at least one storage push.
ConfigMap Example (Database with Storage)
data:
it-ops.yaml: |
version: "1"
hierarchy:
organization: "myorg"
platform: "myplatform"
environment: "prod"
cluster: "cluster1"
service: "postgresql"
service:
name: "postgresql"
criticality: "critical"
slaGroup: "payment-system"
workloadType: "statefulset"
workloadName: "postgresql"
type: "database"
tags:
- database
- storage
metadata:
serviceType: "PostgreSQL"
allocatedStorage: "107374182400" # 100 GB in bytes
usedStorage: "53687091200" # 50 GB in bytes
freePercent: "50"
version: "17.4"
usedBy: # Linked services (shown in Storage tab)
- name: "payment-api"
displayName: "Payment API"
- name: "user-service"
displayName: "User Service"
operations:
backup:
expected: true
maxAgeDays: 1
schedule: "0 2 * * *"
storageSize: "100Gi"
Linked Services
The Storage tab shows which services depend on a storage service. Define linked services in the ConfigMap metadata.usedBy array. Each entry needs name (service identifier) and displayName (shown in UI).
Storage Status Levels
| Free Space | Status | Color |
|---|---|---|
| > 30% | Healthy | Green |
| 10-30% | Warning | Yellow |
| < 10% | Critical | Red |
Push Storage Metrics (Webhook)
For dynamic storage monitoring, push metrics via the REST API. The endpoint updates service metadata in real-time.
# Report storage usage for a service (always include nodeId!)
curl -X POST https://api.yourdomain.com/api/v1/storage/report \
-H "X-API-Key: $OPERATOR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "postgresql",
"nodeId": "myorg/platform/prod/cluster1",
"allocatedBytes": 107374182400,
"usedBytes": 53687091200,
"storageType": "pvc",
"mountPath": "/var/lib/postgresql"
}'
The API key is the same ITOPS_SECURITY_OPERATOR_API_KEY used by the agent.
CronJob Example (K8s PVC monitoring)
apiVersion: batch/v1
kind: CronJob
metadata:
name: itops-storage-reporter
spec:
schedule: "*/15 * * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: reporter
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
API_URL="http://itops-core.itops:8080"
API_KEY="your-operator-api-key"
NODE_ID="myorg/platform/prod/cluster1" # matches agent hierarchy
for POD in $(kubectl get pods -n production -o name); do
NAME=$(echo $POD | sed 's|pod/||')
USAGE=$(kubectl exec -n production $NAME -- df /data 2>/dev/null | tail -1)
if [ -n "$USAGE" ]; then
TOTAL=$(echo $USAGE | awk '{print $2*1024}')
USED=$(echo $USAGE | awk '{print $3*1024}')
curl -s -X POST "$API_URL/api/v1/storage/report" \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d "{\"service\":\"$NAME\",\"nodeId\":\"$NODE_ID\",\"allocatedBytes\":$TOTAL,\"usedBytes\":$USED,\"storageType\":\"pvc\"}"
fi
done
External Service Health (Bare Metal / VM)
Services that are not running in Kubernetes (bare-metal databases, VMs, network appliances) can push their health status via webhook. The service is auto-created on first push.
Health Push Endpoint
# Push health from a bare-metal Galera node (cron every 30s)
curl -X POST https://api.yourdomain.com/api/v1/health/report \
-H "X-API-Key: $OPERATOR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "galera-node1",
"status": "OPERATIONAL",
"message": "wsrep_cluster_size=3, wsrep_ready=ON",
"nodeId": "myorg/infra/prod/baremetal",
"criticality": "critical",
"slaGroup": "database-cluster",
"serviceType": "database",
"tags": ["database", "storage", "galera"]
}'
Status values: OPERATIONAL, DEGRADED, DOWN, MAINTENANCE, UNKNOWN
nodeId format is org/platform/env/cluster (same 4-level path as the agent). Subsequent pushes update status only.
CronJob Example (systemd timer or cron)
On bare metal, load the API key from a root-only file, not from the script body — scripts in /usr/local/bin often end up in backup and log-shipping. chmod 600 /etc/itops/api-key.
#!/bin/bash
# /usr/local/bin/itops-health-push.sh
# Run every 30s via systemd timer or cron
API_URL="https://api.yourdomain.com"
API_KEY=$(cat /etc/itops/api-key) # root-only file, chmod 600
NODE_ID="myorg/infra/prod/dc1"
# Check MySQL/Galera cluster status
if mysqladmin ping --silent 2>/dev/null; then
CLUSTER_SIZE=$(mysql -Nse "SHOW STATUS LIKE 'wsrep_cluster_size'" | awk '{print $2}')
STATUS="OPERATIONAL"
MSG="cluster_size=$CLUSTER_SIZE"
[ "$CLUSTER_SIZE" -lt 3 ] && STATUS="DEGRADED" && MSG="cluster degraded: size=$CLUSTER_SIZE"
else
STATUS="DOWN"
MSG="MySQL not responding"
fi
curl -s -X POST "$API_URL/api/v1/health/report" \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d "{\"service\":\"galera-node1\",\"status\":\"$STATUS\",\"message\":\"$MSG\",\"nodeId\":\"$NODE_ID\"}"
Auto Incidents
When a service goes DOWN, ITOps automatically:
- Creates an SLA incident (source: MONITORING)
- Generates an INCIDENT ticket (if ticketing plugin active)
- Updates the SLA dashboard in real-time
- Closes the incident when service recovers
Stale Detection (external / push services)
A background goroutine in itops-core runs every 60 s and flips EXTERNAL services (those created via push webhooks, not the K8s agent) to UNKNOWN after 2 minutes of silence, and to DOWN after 5 minutes. This guards against dead cronjobs silently hiding real outages — if the push source breaks, the UI makes it visible within a minute or two.
Agent-reported services (K8s workloads) don't use this timeout: they're driven by the agent's regular 30 s sync, so pausing the agent shows up directly as a missing heartbeat in the Operator Nodes view.
Multi-Cluster SLA Groups
SLA groups (slaGroup: "payment-system") are a cross-cluster logical concept. The sla_groups table is UNIQUE(name) — if two agents in different clusters, or an agent and a bare-metal push webhook, report the same slaGroup name, all of them join the same group row. Memberships from every source are merged, and the group card on the SLA Overview tab lists them together regardless of origin.
Both the agent sync loop and every push endpoint call the same EnsureSLAGroup / UpsertSLAGroupMember middleware — there is no separate "external" group path, so there's no way to accidentally create duplicate groups with the same name.
Manual Report Trigger
Normally the daily SLA report is generated at 07:00 local time and, if ITOPS_SLA_PORTAL_URL is set, pushed to the SLA Portal. For development, QA, or a post-incident re-generation you can trigger it on demand:
curl -X POST https://api.yourdomain.com/api/v1/sla/report/generate \
-H "X-API-Key: $OPERATOR_API_KEY"
The endpoint returns the generated report's date and path. It's idempotent for the current day (overwrites the day's existing report).