ITOps - Monitoring

SLA Monitoring

SLA is measured from real agent data using 5-minute snapshots. The aggregator runs every 5 minutes on a 15-minute delay — i.e. at 12:05 it processes the 11:45–11:50 window. The delay ensures every agent in every cluster has had time to report for that window (agents sync every 30 s). For dev/test environments you can trigger the daily report immediately with POST /api/v1/sla/report/generate instead of waiting for 07:00.

How it works

Component	Interval	Function
Agent Sync	30s	Reports service status (OPERATIONAL/DEGRADED/DOWN) + writes sla_snapshots
Aggregator	5 min	Buckets snapshots into 5-min windows, calculates uptime % (auto-starts)
Period Results	Per aggregation	Daily + monthly uptime % calculated and stored
Daily Report	07:00 daily	Generates JSON + PDF report, pushes to SLA Portal
Cleanup	1 hour	Deletes aggregated snapshots older than 90 days

Prerequisites for SLA data: 1. Agent running and syncing services (check: sla_snapshots table not empty)
2. Services have workloadType + workloadName in ConfigMap (status != UNKNOWN)
3. SLA definitions + service assignments exist (auto-created from tier config)
4. Wait ~20 minutes for first aggregation cycle (15-min delay + 5-min interval)

SLA Tiers

Tier	Uptime Target	Response Time	Resolution Time
Critical	99.99%	15 min	4 hours
High	99.9%	60 min	8 hours
Medium	99.5%	4 hours	3 days
Low	99.0%	24 hours	5 days

Backup Monitoring

ITOps tracks backup status for services that have backup.expected: true in their ConfigMap. The Backup tab shows the last backup time, status, and alerts if a backup is overdue.

How it works (2 steps)

GitOps (ConfigMap) — Define operations.backup.expected: true and maxAgeDays in the service ConfigMap. The agent registers it and it appears in the Backup tab.
Push (Webhook) — Your backup CronJob/script sends a completion report via POST /api/v1/backup/report. If the service has no ConfigMap, the first push auto-creates it.

If no backup report arrives within maxAgeDays, the service shows as Stale (overdue) in the Backup tab.

Required fields

Only service (or slaGroup/namespace for group-level reports) is strictly required. nodeId is not rejected when missing — instead the backend places the service under an "unknown" hierarchy node with a red badge in the UI. This is deliberate: it makes misconfigurations visible instead of silently dropping data.

Always pass a real nodeId (organization/platform/environment/cluster) from cronjobs. If you see a service under the red "unknown" node in Operations Catalog, the push source is missing nodeId — fix the cronjob and re-apply.

Webhook Call

# Service-level report (nodeId recommended for correct placement)
curl -X POST https://api.yourdomain.com/api/v1/backup/report \
  -H "X-API-Key: $OPERATOR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"service":"my-database","nodeId":"myorg/platform/prod/cluster1","status":"success","sizeBytes":5242880}'

# SLA Group-level report (propagates to all services with backup.expected=true)
curl -X POST https://api.yourdomain.com/api/v1/backup/report \
  -H "X-API-Key: $OPERATOR_API_KEY" \
  -d '{"slaGroup":"payment-system","status":"success"}'

# Namespace-level report
curl -X POST https://api.yourdomain.com/api/v1/backup/report \
  -H "X-API-Key: $OPERATOR_API_KEY" \
  -d '{"namespace":"production","status":"success"}'

Storage Monitoring

The Operations Catalog has a Storage tab that shows disk usage for database and cache services.

How it works (2 options)

GitOps (ConfigMap) + Push — Define the service with storage tags in a ConfigMap. The agent registers it (no usage data yet). Push metrics via POST /api/v1/storage/report with a nodeId that matches the hierarchy — the handler updates the existing service row.
Push only (no ConfigMap) — Send storage metrics for a service that was never declared in a ConfigMap. The handler auto-creates the service row (source=external, tags=[storage], criticality=medium) on the first push. Ideal for bare-metal disks, external S3 buckets, RDS instances.

Required fields

Only service is strictly required. nodeId is not rejected when missing — instead the service appears under an "unknown" hierarchy node with a red badge, making the misconfig visible rather than silently dropping data. Always pass a real nodeId (organization/platform/environment/cluster) from cronjobs so the service lands under the correct cluster.

At least one of freePercent OR (allocatedBytes + usedBytes) must be present for the dashboard to show numbers; if both are omitted, the handler assumes 100% free.

Required Tags

A service appears in the Storage tab if it has any of these tags: storage, s3, ebs, database, rds, elasticache, cache. Only the storage endpoint auto-appends the storage tag to the service row — health and backup endpoints never modify tags, so a service that you want on the Storage tab must either declare the tag in its ConfigMap OR receive at least one storage push.

ConfigMap Example (Database with Storage)

data:
  it-ops.yaml: |
    version: "1"
    hierarchy:
      organization: "myorg"
      platform: "myplatform"
      environment: "prod"
      cluster: "cluster1"
      service: "postgresql"
    service:
      name: "postgresql"
      criticality: "critical"
      slaGroup: "payment-system"
      workloadType: "statefulset"
      workloadName: "postgresql"
      type: "database"
    tags:
      - database
      - storage
    metadata:
      serviceType: "PostgreSQL"
      allocatedStorage: "107374182400"   # 100 GB in bytes
      usedStorage: "53687091200"         # 50 GB in bytes
      freePercent: "50"
      version: "17.4"
      usedBy:                            # Linked services (shown in Storage tab)
        - name: "payment-api"
          displayName: "Payment API"
        - name: "user-service"
          displayName: "User Service"
    operations:
      backup:
        expected: true
        maxAgeDays: 1
        schedule: "0 2 * * *"
        storageSize: "100Gi"

Linked Services

The Storage tab shows which services depend on a storage service. Define linked services in the ConfigMap metadata.usedBy array. Each entry needs name (service identifier) and displayName (shown in UI).

Storage Status Levels

Free Space	Status	Color
> 30%	Healthy	Green
10-30%	Warning	Yellow
< 10%	Critical	Red

Push Storage Metrics (Webhook)

For dynamic storage monitoring, push metrics via the REST API. The endpoint updates service metadata in real-time.

# Report storage usage for a service (always include nodeId!)
curl -X POST https://api.yourdomain.com/api/v1/storage/report \
  -H "X-API-Key: $OPERATOR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "postgresql",
    "nodeId": "myorg/platform/prod/cluster1",
    "allocatedBytes": 107374182400,
    "usedBytes": 53687091200,
    "storageType": "pvc",
    "mountPath": "/var/lib/postgresql"
  }'

The API key is the same ITOPS_SECURITY_OPERATOR_API_KEY used by the agent.

CronJob Example (K8s PVC monitoring)

apiVersion: batch/v1
kind: CronJob
metadata:
  name: itops-storage-reporter
spec:
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: reporter
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  API_URL="http://itops-core.itops:8080"
                  API_KEY="your-operator-api-key"
                  NODE_ID="myorg/platform/prod/cluster1"   # matches agent hierarchy
                  for POD in $(kubectl get pods -n production -o name); do
                    NAME=$(echo $POD | sed 's|pod/||')
                    USAGE=$(kubectl exec -n production $NAME -- df /data 2>/dev/null | tail -1)
                    if [ -n "$USAGE" ]; then
                      TOTAL=$(echo $USAGE | awk '{print $2*1024}')
                      USED=$(echo $USAGE | awk '{print $3*1024}')
                      curl -s -X POST "$API_URL/api/v1/storage/report" \
                        -H "X-API-Key: $API_KEY" \
                        -H "Content-Type: application/json" \
                        -d "{\"service\":\"$NAME\",\"nodeId\":\"$NODE_ID\",\"allocatedBytes\":$TOTAL,\"usedBytes\":$USED,\"storageType\":\"pvc\"}"
                    fi
                  done

External Service Health (Bare Metal / VM)

Services that are not running in Kubernetes (bare-metal databases, VMs, network appliances) can push their health status via webhook. The service is auto-created on first push.

Health Push Endpoint

# Push health from a bare-metal Galera node (cron every 30s)
curl -X POST https://api.yourdomain.com/api/v1/health/report \
  -H "X-API-Key: $OPERATOR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "galera-node1",
    "status": "OPERATIONAL",
    "message": "wsrep_cluster_size=3, wsrep_ready=ON",
    "nodeId": "myorg/infra/prod/baremetal",
    "criticality": "critical",
    "slaGroup": "database-cluster",
    "serviceType": "database",
    "tags": ["database", "storage", "galera"]
  }'

Status values: OPERATIONAL, DEGRADED, DOWN, MAINTENANCE, UNKNOWN

Auto-create: On first push, the service and its hierarchy node are created automatically. The nodeId format is org/platform/env/cluster (same 4-level path as the agent). Subsequent pushes update status only.

CronJob Example (systemd timer or cron)

On bare metal, load the API key from a root-only file, not from the script body — scripts in /usr/local/bin often end up in backup and log-shipping. chmod 600 /etc/itops/api-key.

#!/bin/bash
# /usr/local/bin/itops-health-push.sh
# Run every 30s via systemd timer or cron

API_URL="https://api.yourdomain.com"
API_KEY=$(cat /etc/itops/api-key)     # root-only file, chmod 600
NODE_ID="myorg/infra/prod/dc1"

# Check MySQL/Galera cluster status
if mysqladmin ping --silent 2>/dev/null; then
  CLUSTER_SIZE=$(mysql -Nse "SHOW STATUS LIKE 'wsrep_cluster_size'" | awk '{print $2}')
  STATUS="OPERATIONAL"
  MSG="cluster_size=$CLUSTER_SIZE"
  [ "$CLUSTER_SIZE" -lt 3 ] && STATUS="DEGRADED" && MSG="cluster degraded: size=$CLUSTER_SIZE"
else
  STATUS="DOWN"
  MSG="MySQL not responding"
fi

curl -s -X POST "$API_URL/api/v1/health/report" \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{\"service\":\"galera-node1\",\"status\":\"$STATUS\",\"message\":\"$MSG\",\"nodeId\":\"$NODE_ID\"}"

Auto Incidents

When a service goes DOWN, ITOps automatically:

Creates an SLA incident (source: MONITORING)
Generates an INCIDENT ticket (if ticketing plugin active)
Updates the SLA dashboard in real-time
Closes the incident when service recovers

Stale Detection (external / push services)

A background goroutine in itops-core runs every 60 s and flips EXTERNAL services (those created via push webhooks, not the K8s agent) to UNKNOWN after 2 minutes of silence, and to DOWN after 5 minutes. This guards against dead cronjobs silently hiding real outages — if the push source breaks, the UI makes it visible within a minute or two.

Agent-reported services (K8s workloads) don't use this timeout: they're driven by the agent's regular 30 s sync, so pausing the agent shows up directly as a missing heartbeat in the Operator Nodes view.

Multi-Cluster SLA Groups

SLA groups (slaGroup: "payment-system") are a cross-cluster logical concept. The sla_groups table is UNIQUE(name) — if two agents in different clusters, or an agent and a bare-metal push webhook, report the same slaGroup name, all of them join the same group row. Memberships from every source are merged, and the group card on the SLA Overview tab lists them together regardless of origin.

Both the agent sync loop and every push endpoint call the same EnsureSLAGroup / UpsertSLAGroupMember middleware — there is no separate "external" group path, so there's no way to accidentally create duplicate groups with the same name.

Manual Report Trigger

Normally the daily SLA report is generated at 07:00 local time and, if ITOPS_SLA_PORTAL_URL is set, pushed to the SLA Portal. For development, QA, or a post-incident re-generation you can trigger it on demand:

curl -X POST https://api.yourdomain.com/api/v1/sla/report/generate \
  -H "X-API-Key: $OPERATOR_API_KEY"

The endpoint returns the generated report's date and path. It's idempotent for the current day (overwrites the day's existing report).