Service Level Objectives (SLOs) and Service Level Indicators (SLIs)¶
This document defines the SLOs and SLIs for the dev-health-ops platform. These targets govern the reliability contract for the analytics API, data pipeline, and background workers.
1. API Availability¶
SLI¶
Proportion of HTTP requests that return a non-5xx response, measured over a rolling 30-day window.
SLI = (requests with status < 500) / (total requests)
Prometheus query:
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
SLO¶
| Tier | Target | Error Budget (30d) |
|---|---|---|
| Production | 99.5% | ~3.6 hours |
| Staging | 95.0% | ~36 hours |
2. API Latency¶
SLI¶
Proportion of requests that complete within the latency threshold.
SLI (P95) = fraction of requests with duration < 2.0s
SLI (P99) = fraction of requests with duration < 5.0s
Prometheus query (P95):
histogram_quantile(
0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
SLO¶
| Percentile | Target Latency | SLO (% of window within target) |
|---|---|---|
| P50 | < 500ms | 99% |
| P95 | < 2.0s | 99% |
| P99 | < 5.0s | 95% |
3. Analytics Data Freshness¶
SLI¶
Time elapsed since the most recent successful metrics ingestion for any active repository, measured as the age of the latest record in ClickHouse.
SLI = age_of_latest_record < freshness_threshold
SLO¶
| Metric | Threshold | Target |
|---|---|---|
| Daily rollup freshness | < 26h | 99.5% |
| Commit data freshness | < 4h | 99% |
| Work item freshness | < 6h | 99% |
Alert: See alerts/rules.yml — future DataStaleness alert group.
4. Celery Worker Reliability¶
SLI¶
Proportion of Celery task executions that complete successfully (not
FAILURE or REVOKED), measured over 24 hours.
SLI = devhealth_celery_tasks_total{state="success"} / devhealth_celery_tasks_total
Prometheus query:
sum(rate(devhealth_celery_tasks_total{state="success"}[24h]))
/
sum(rate(devhealth_celery_tasks_total[24h]))
SLO¶
| Queue | Target Success Rate |
|---|---|
| metrics | 99% |
| sync | 99% |
| webhooks | 99.5% |
| default | 95% |
5. LLM Categorisation Reliability¶
SLI¶
Proportion of LLM categorisation requests that return a valid result (not an error or repair-path fallback).
SLI = devhealth_llm_requests_total{status="success"} / devhealth_llm_requests_total
SLO¶
| Target Success Rate | Latency P95 |
|---|---|
| 95% | < 30s |
6. ClickHouse Query Latency¶
SLI¶
P95 latency of analytical queries executed against ClickHouse.
Prometheus query:
histogram_quantile(
0.95,
sum(rate(devhealth_clickhouse_query_duration_seconds_bucket[5m])) by (le)
)
SLO¶
| Percentile | Target |
|---|---|
| P50 | < 500ms |
| P95 | < 2.0s |
| P99 | < 5.0s |
Error Budget Policy¶
- Error budget is calculated per 30-day rolling window.
- When error budget drops below 50%, engineering is notified.
- When error budget drops below 20%, all non-critical feature work is paused until reliability is restored.
- Post-mortems are mandatory for any incident that consumes > 10% of the monthly error budget.
Review Cadence¶
SLOs are reviewed quarterly by the platform team. Targets are adjusted based on observed reliability trends and business requirements.
Last updated: 2026-02-27 (CHAOS-677)