API Reference

GraphQL primary API + REST endpoints for agents and webhooks.

Overview

EndpointProtocolAuthPurpose
/health, /readyGETNoneHealth + readiness probes (returns version on /health)
/graphqlPOSTJWT BearerFrontend API (queries, mutations)
/graphql/wsWebSocketJWTReal-time subscriptions
/api/v1/operator/*RESTX-API-Key or BearerAgent communication (status, heartbeat, register, services)
/api/v1/health/reportREST POSTX-API-Key or BearerExternal / bare-metal service health push (auto-creates service)
/api/v1/storage/reportREST POSTX-API-Key or BearerStorage metrics push (auto-creates service with storage tag)
/api/v1/backup/reportREST POSTX-API-Key or BearerBackup completion webhook (service / slaGroup / namespace)
/api/v1/sla/exclusion-window/startREST POSTX-API-Key or BearerStart a maintenance window (pauses SLA timer)
/api/v1/sla/exclusion-window/stopREST POSTX-API-Key or BearerStop a maintenance window
/api/v1/sla/report/generateREST POSTX-API-Key or BearerManually trigger the daily SLA report
/api/v1/templates/exportGETX-API-Key or BearerExport workflows / catalog / SLA definitions as YAML
/api/v1/templates/importPOSTX-API-Key or BearerImport workflows / catalog / SLA definitions from YAML
/api/v1/auth/providersGETNoneActive auth providers (login page uses this)
/api/v1/license/activateREST POSTX-API-Key or BearerLicense activation (authenticated since v4.1.2)

Authentication

Frontend (JWT)

Frontend tokens are HMAC-SHA256-signed, using ITOPS_JWT_SECRET (min 32 chars). Tokens expire by default in 1 hour; refresh via refreshToken.

# Login
POST /graphql
{
  "query": "mutation { login(email: \"user@example.com\", password: \"...\") { token refreshToken user { id email } } }"
}

# Use token
POST /graphql
Authorization: Bearer <jwt-token>
Content-Type: application/json

License tokens are different: Ed25519-signed (not HMAC), issued by the ITOps license-gen CLI, validated once at backend startup. They go in ITOPS_LICENSE_KEY, not in the Authorization header.

Agent / webhook callers (API Key)

Both headers are accepted. Pick one — CronJobs and shell scripts usually prefer X-API-Key.

X-API-Key: <operator-api-key>
# or
Authorization: Bearer <operator-api-key>

GraphQL API

Key Queries

# SLA Groups with backup status
{
  slaGroups {
    id name displayName tier status currentUptime
    services {
      serviceName status replicas readyReplicas
      backupStatus {
        backupExpected lastBackupAt lastBackupStatus
        backupMaxAgeDays backupOverdue
      }
    }
  }
}

# Real-time snapshot trend (5-min buckets)
{
  slaSnapshotTrend(serviceId: "uuid", hoursBack: 1) {
    periodKey periodStart actualValue targetValue
    incidentCount downtimeMinutes status
  }
}

# SLA trend data (daily/monthly)
{
  slaTrendData(filter: { periodType: "DAILY", monthsBack: 1 }) {
    periodKey periodStart actualValue status
  }
}

# Dashboard stats
{
  slaDashboardStats {
    totalServices servicesWithSla metCount
    atRiskCount breachedCount averageUptime
  }
}

Operator REST API

POST /api/v1/operator/status

Sync service statuses from agent. Called every 30 seconds.

{
  "nodeId": "myorg/platform/prod/cluster1",
  "operatorVersion": "1.0.0",
  "services": [
    {
      "name": "my-api",
      "status": "OPERATIONAL",
      "replicas": 3,
      "readyReplicas": 3,
      "slaGroup": "payment-system",
      "workloadType": "Deployment"
    }
  ],
  "slaGroups": [
    {
      "name": "payment-system",
      "displayName": "Payment System",
      "tier": "critical"
    }
  ]
}

POST /api/v1/operator/heartbeat

{
  "nodeId": "myorg/platform/prod/cluster1",
  "version": "1.0.0",
  "watchedServices": 7,
  "healthyServices": 6,
  "unhealthyServices": 1
}

POST /api/v1/operator/register

Register new service discovered from it-ops.yaml ConfigMap.

{
  "name": "my-database",
  "displayName": "My Database",
  "nodeId": "myorg/platform/prod/cluster1",
  "criticality": "critical",
  "operations": {
    "backup": {
      "expected": true,
      "maxAgeDays": 1
    }
  }
}

GET /api/v1/operator/services?nodeId=...

Returns expected services for the agent to monitor.

Backup Webhook

POST /api/v1/backup/report

Three addressing modes. For service-level reports always include nodeId — missing nodeId falls back to "unknown" and the service lands under a red badge in the UI (detection over silence).

# Service-level (include nodeId for correct placement)
{
  "service": "my-database",
  "nodeId": "myorg/platform/prod/cluster1",
  "status": "success",        // success | failed | partial
  "sizeBytes": 5242880,
  "message": "pg_dump completed"
}

# SLA Group-level (propagates to every member with backup.expected=true)
{
  "slaGroup": "payment-system",
  "status": "success"
}

# Namespace-level
{
  "namespace": "production",
  "status": "success"
}

# Response
{
  "success": true,
  "message": "backup report recorded for 3 services",
  "affected": 3,
  "services": ["db-1", "db-2", "cache-1"]
}

Health Push (external / bare-metal)

POST /api/v1/health/report

Push health for services that aren't running in Kubernetes (VMs, physical hardware, external SaaS). First push auto-creates the service and its hierarchy node; subsequent pushes update status only.

{
  "service": "galera-node1",
  "nodeId": "myorg/infra/prod/baremetal",
  "status": "OPERATIONAL",    // OPERATIONAL | DEGRADED | DOWN | MAINTENANCE | UNKNOWN
  "message": "wsrep_cluster_size=3",
  "criticality": "critical",
  "slaGroup": "database-cluster",
  "serviceType": "database",
  "tags": ["database", "baremetal"]
}

Storage Push

POST /api/v1/storage/report

Push disk / storage usage. Auto-creates the service with the storage tag on first push so it shows up on the Storage tab immediately.

{
  "service": "postgresql",
  "nodeId": "myorg/platform/prod/cluster1",
  "allocatedBytes": 107374182400,
  "usedBytes": 53687091200,
  "storageType": "pvc",       // disk | pvc | s3 | rds | efs | ...
  "mountPath": "/var/lib/postgresql"
}

# Response
{
  "success": true,
  "serviceName": "postgresql",
  "freePercent": 50,
  "status": "healthy"         // healthy (>30% free) | warning (10-30%) | critical (<10%)
}

SLA Exclusion Windows (Maintenance)

# Start a maintenance window (pauses SLA calculation for the service)
POST /api/v1/sla/exclusion-window/start
{
  "service": "postgresql",
  "nodeId": "myorg/platform/prod/cluster1",
  "reason": "scheduled patching",
  "expectedEndAt": "2026-04-20T02:00:00Z"
}

# Stop the active window
POST /api/v1/sla/exclusion-window/stop
{
  "service": "postgresql",
  "nodeId": "myorg/platform/prod/cluster1"
}

License API

POST /api/v1/license/activate

{
  "licenseKey": "eyJhbGciOiJFZERTQSIs..."
}

// Response
{
  "success": true,
  "message": "License activated",
  "customer": "My Company",
  "plugins": ["ticketing", "sla", "audit"]
}

Outbound HTTP safety (webhooks & workflow HTTP steps)

Every outbound HTTP call the backend makes on behalf of an admin — webhooks and workflow HTTP_REQUEST steps — goes through a shared SSRF validator. A webhook URL that hits any of the following returns a ExecFailed execution with a clear error in the webhook history UI (no socket is opened):

For legitimate in-cluster targets, use the host allowlist env var: ITOPS_SECURITY_WEBHOOK_HOST_ALLOWLIST=host1,host2,.... In dev environments the whole block can be lifted with ITOPS_SECURITY_ALLOW_PRIVATE_WEBHOOKS=true — don't do this in production.

WebSocket Subscriptions

Connect to /graphql/ws for real-time updates via GraphQL subscriptions.

# Events available:
- ticket:created, ticket:updated, ticket:deleted
- sla:alert, sla:incident
- license:updated
- service:status_changed