Skip to content

Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

This document defines the SLOs and SLIs for the dev-health-ops platform. These targets govern the reliability contract for the analytics API, data pipeline, and background workers.


1. API Availability

SLI

Proportion of HTTP requests that return a non-5xx response, measured over a rolling 30-day window.

SLI = (requests with status < 500) / (total requests)

Prometheus query:

1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
)

SLO

Tier Target Error Budget (30d)
Production 99.5% ~3.6 hours
Staging 95.0% ~36 hours

2. API Latency

SLI

Proportion of requests that complete within the latency threshold.

SLI (P95) = fraction of requests with duration < 2.0s
SLI (P99) = fraction of requests with duration < 5.0s

Prometheus query (P95):

histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

SLO

Percentile Target Latency SLO (% of window within target)
P50 < 500ms 99%
P95 < 2.0s 99%
P99 < 5.0s 95%

3. Analytics Data Freshness

SLI

Time elapsed since the most recent successful metrics ingestion for any active repository, measured as the age of the latest record in ClickHouse.

SLI = age_of_latest_record < freshness_threshold

SLO

Metric Threshold Target
Daily rollup freshness < 26h 99.5%
Commit data freshness < 4h 99%
Work item freshness < 6h 99%

Alert: See alerts/rules.yml — future DataStaleness alert group.


4. Celery Worker Reliability

SLI

Proportion of Celery task executions that complete successfully (not FAILURE or REVOKED), measured over 24 hours.

SLI = devhealth_celery_tasks_total{state="success"} / devhealth_celery_tasks_total

Prometheus query:

sum(rate(devhealth_celery_tasks_total{state="success"}[24h]))
/
sum(rate(devhealth_celery_tasks_total[24h]))

SLO

Queue Target Success Rate
metrics 99%
sync 99%
webhooks 99.5%
default 95%

5. LLM Categorisation Reliability

SLI

Proportion of LLM categorisation requests that return a valid result (not an error or repair-path fallback).

SLI = devhealth_llm_requests_total{status="success"} / devhealth_llm_requests_total

SLO

Target Success Rate Latency P95
95% < 30s

6. ClickHouse Query Latency

SLI

P95 latency of analytical queries executed against ClickHouse.

Prometheus query:

histogram_quantile(
  0.95,
  sum(rate(devhealth_clickhouse_query_duration_seconds_bucket[5m])) by (le)
)

SLO

Percentile Target
P50 < 500ms
P95 < 2.0s
P99 < 5.0s

Error Budget Policy

  • Error budget is calculated per 30-day rolling window.
  • When error budget drops below 50%, engineering is notified.
  • When error budget drops below 20%, all non-critical feature work is paused until reliability is restored.
  • Post-mortems are mandatory for any incident that consumes > 10% of the monthly error budget.

Review Cadence

SLOs are reviewed quarterly by the platform team. Targets are adjusted based on observed reliability trends and business requirements.

Last updated: 2026-02-27 (CHAOS-677)