PRD: TestOps¶

Linear Milestone: https://linear.app/fullchaos/project/dev-health-ops-f947bce19f4c/overview

Product Name¶

TestOps for Dev Health Metrics

Summary¶

TestOps extends Dev Health Metrics beyond source control and workflow analytics into build, test, and release execution. It ingests fine-grained CI/CD and test telemetry, normalizes it, and turns it into team-level and system-level signals about build health, test effectiveness, release risk, and engineering drag.

The goal is not just to show pipeline numbers. The goal is to connect delivery performance to test quality, defect risk, and developer health.

Current State (Existing Infrastructure)¶

The platform already has partial CI/CD and delivery analytics that TestOps will extend:

Capability	Status	Key Files
Pipeline run ingestion (status, duration, queue time)	Exists	`connectors/github.py`, `connectors/gitlab.py`, `storage/mixins/cicd.py`
ClickHouse tables: `ci_pipeline_runs`, `deployments`, `incidents`	Exists	`migrations/clickhouse/000_raw_tables.sql`
Pipeline metrics: success_rate, avg/p90 duration, queue minutes	Exists	`metrics/compute_cicd.py`
DORA metrics: deploy frequency, lead time, MTTR, change failure rate	Exists	`metrics/job_dora.py`, `migrations/clickhouse/023b_dora_metrics.sql`
Well-being metrics: after-hours ratio, weekend ratio	Exists	`metrics/compute_wellbeing.py`
Schemas: `PipelineRunRow`, `DeploymentRow`, `IncidentRow`, `DORAMetricsRecord`	Exists	`metrics/schemas.py`
Test execution models (suite, case, flakiness)	Does not exist	—
Code coverage ingestion and storage	Does not exist	—
Test-to-code ownership mapping	Does not exist	—
Flakiness detection and tracking	Does not exist	—

TestOps extends this foundation. Epics should build on existing schemas and connectors, not replace them.

Problem¶

The current model covers delivery flow, code risk, collaboration, and cognitive load well, but it underweights the execution layer where engineering quality is either validated or exposed. Without TestOps:

teams can ship fast while silently accumulating flaky tests and unstable pipelines
leaders cannot distinguish slow delivery caused by review/process from slow delivery caused by broken CI
test coverage is often reported as a vanity metric rather than a risk metric
test failures, retry behavior, and flaky suites are not tied back to repos, teams, services, or changesets
there is no reliable way to quantify quality drag on developer throughput

Goals¶

Ingest CI/CD, test execution, and coverage data from major delivery systems
Quantify pipeline health, test reliability, and release risk at org, team, repo, service, and PR levels
Attribute build and test pain to teams, code areas, and change events
Expose actionable insights, not just dashboards
Integrate TestOps signals into Delivery, Durability, and Developer Well-being scores

Non-Goals¶

Replace CI/CD vendors or test runners
Act as a full test management system
Author or orchestrate test execution directly in v1
Provide root-cause analysis beyond supported correlations and heuristics

Target Users¶

Engineering leaders
Dev productivity / developer experience teams
QA / quality engineering leaders
Platform engineering teams
EMs and tech leads
Release managers

Core User Stories¶

As an engineering leader, I want to see which teams lose the most time to CI instability so I know where platform investment is justified.
As a dev productivity owner, I want to identify flaky tests by suite, owner, service, and impact so I can prioritize cleanup.
As a team lead, I want to understand whether release risk is driven by low coverage, unstable tests, or recent defect escape.
As a developer, I want to know whether my PR is likely to fail deployment based on historical signals.
As a QA lead, I want to see whether increased test volume is improving confidence or just increasing runtime.

Key Concepts¶

Pipeline Health: execution speed, reliability, retry burden, queueing, and failure modes
Test Reliability: pass consistency, flakiness, quarantine rate, rerun dependence
Coverage Quality: meaningful change coverage, not just aggregate line coverage
Release Confidence: probability a change can move through CI/CD without regressions
Quality Drag: time lost to failed pipelines, reruns, flaky tests, and blocked deploys

Data Sources¶

CI/CD Systems¶

GitHub Actions (v1)
GitLab CI (v1)
Jenkins (v1 — abstraction layer shared with Buildkite)
Buildkite (v1 — abstraction layer shared with Jenkins)
CircleCI (v2 — deferred from initial milestone)
Azure DevOps pipelines (v2 — deferred from initial milestone)

Test Frameworks and Artifacts¶

JUnit/XML
pytest
Jest
Playwright
Cypress
xUnit variants
coverage reports such as lcov, Cobertura, JaCoCo
Deployment systems where available
Existing repo, PR, issue, and code ownership data in Dev Health Metrics

Functional Requirements¶

1. CI/CD Ingestion¶

Ingest pipeline runs, jobs, stages, queue time, runtime, result, retry count, cancel reason, and trigger source
Associate pipeline runs to commit, branch, PR/MR, repo, service, and team
Support both polling and webhook ingestion patterns
Handle backfills and incremental sync

2. Test Execution Ingestion¶

Ingest test suite, test case, duration, status, retries, skipped/quarantined state, environment, and artifact links
Track failures over time at test case and suite level
Detect test ownership through repo/service/team mappings

3. Coverage Ingestion¶

Ingest overall and changed-file coverage
Track delta coverage on modified files and touched code paths
Support branch, PR, repo, and service views

4. Metrics Engine¶

Compute at least the following:

Pipeline Metrics¶

Pipeline success rate
Pipeline failure rate
Median pipeline duration
P95 pipeline duration
Queue time
Rerun rate
Cancel rate
Deployment frequency
Change failure rate
MTTR for failed deploys
Failed deployment recovery time

Test Metrics¶

Test pass rate
Test failure rate
Flake rate
Retry dependency rate
Test suite duration
Critical path test duration
Slowest suites/tests
Quarantined test count
Failure recurrence score

Coverage Metrics¶

Global coverage
Changed-code coverage
Critical-path coverage
Coverage regression rate
Uncovered change count
Coverage-to-defect correlation

Derived Risk Metrics¶

Release confidence score
Test reliability index
Pipeline stability index
Quality drag hours
Escaped defect risk
Merge risk score

5. Entity-Level Views¶

Allow all metrics to be sliced by: - org - business unit - team - repo - service - application - branch - PR/MR - developer - date range - environment

6. Insights and Alerts¶

Surface insights such as: - 40% of failed pipelines in Team A are caused by 12 flaky end-to-end tests - Changed-code coverage dropped below threshold in the payments service for 3 consecutive weeks - Build queue time increased 2.3x after runner capacity saturation - Service X has the highest release risk due to low coverage plus high failure recurrence

7. Benchmarking¶

Compare teams against internal baselines
Compare current period vs prior period
Support maturity bands such as stable / watch / degraded / critical

UX / Reporting Requirements¶

TestOps dashboard with tabs for Pipelines, Tests, Coverage, Release Risk, and Cost of Quality
Drill-down from org to team to repo to failing suite to failing test
Time-series views with weekly and monthly trends
Correlation panels:
pipeline instability vs cycle time
flake rate vs PR lead time
changed-code coverage vs defect escape
Heatmaps for flaky tests and unstable repos
PR-level widget showing:
changed-code coverage
likely failing suites
release confidence
historical impact area risk

Scoring Model Integration¶

TestOps signals feed into the platform's analytical dimensions. These dimensions are currently implemented as: - Backend-computed metrics in dedicated modules (compute_cicd.py, compute_wellbeing.py, quality.py, compute_ic.py) - Frontend visualization zones in quadrantZones.ts that classify team operating state from raw metric inputs

TestOps adds new metric inputs to these existing computation and visualization layers:

Delivery (backend: compute_cicd.py, job_dora.py — already partially computed)
queue time (exists — extend with TestOps granularity)
pipeline duration (exists — extend with per-stage breakdown)
deploy frequency (exists via DORA)
failed deployment recovery (exists via DORA MTTR)
Durability (backend: quality.py — new TestOps signals needed)
coverage quality (new)
defect escape (new — correlation model)
test reliability (new)
release confidence (new — deterministic composite score)
Developer Well-being (backend: compute_wellbeing.py — extend with TestOps burden signals)
rerun burden (new)
failed build interruption rate (new)
after-hours recovery/retry behavior (extend existing after-hours ratio)
Dynamics (frontend: quadrant zone classification — new TestOps zone inputs)
team ownership of failures (new)
quality burden concentration (new)
cross-team dependency failures (new)

API Requirements¶

Example Objects¶

pipeline_run
pipeline_stage
job_run
test_case_result
test_suite_result
coverage_snapshot
release_risk_score

Required API Capabilities¶

fetch normalized metrics by dimension/entity/time window
fetch raw execution events for drill-down
fetch ranked insights and anomalies
fetch scorecard summaries for dashboards and reports

Success Metrics¶

% of active repos with CI/CD ingestion enabled
% of active repos with test result ingestion enabled
% of PRs with changed-code coverage
reduction in flaky test volume
reduction in rerun burden
reduction in p95 pipeline duration
improvement in release confidence
decrease in failed deployment recovery time

Risks¶

Coverage data quality is inconsistent across ecosystems
CI vendor schemas vary and can be noisy
Flake detection can create false positives if heuristics are simplistic
Teams may game test quantity rather than test quality
Attribution to team ownership can break in shared repos/services

Decisions (Resolved Open Questions)¶

Question	Decision	Rationale
Which CI/CD systems are highest priority for v1?	GitHub Actions + GitLab CI + Jenkins/Buildkite. CircleCI and Azure DevOps deferred to v2.	GitHub and GitLab cover the majority of users. Jenkins/Buildkite share an abstraction layer.
Is changed-code coverage mandatory for v1 or v2?	v2. Global and per-file coverage in v1; changed-code coverage computation deferred to Phase 2.	Changed-code coverage requires diff-aware analysis that adds scope risk to the initial milestone.
How should we model monorepo ownership cleanly?	Use existing CODEOWNERS / team-to-path mappings from the entity resolution contract (Epic 1). Service boundaries inferred from path prefixes when CODEOWNERS is unavailable.	Avoids new ownership primitives; leverages existing team/repo/service entity mapping.
Should release confidence be deterministic, heuristic, or ML-assisted in v1?	Deterministic with documented weights. Heuristic or ML-assisted scoring deferred to Phase 3.	Deterministic models are explainable, auditable, and testable. Prerequisite for future ML.
Do we expose cost metrics (runner spend, wasted compute) in v1?	No. Deferred to Phase 3 (cost-of-quality analytics).	Requires CI vendor billing API integration not yet scoped.

Open Questions (Remaining)¶

How should flake detection handle environment-dependent failures vs true code flakes?
Should test ownership attribution prefer CODEOWNERS, directory convention, or historical authorship?
What is the minimum artifact retention policy for raw test results before rollup-only retention?

Phased Delivery¶

Phase 1¶

CI ingestion
basic test result ingestion
pipeline and flake metrics
dashboard and team views

Phase 2¶

changed-code coverage
release confidence
risk scoring
alerting and anomaly detection

Phase 3¶

predictive failure models
recommended remediation
cost-of-quality analytics
PR-level quality guidance

PRD: AI Generative Reports¶

Product Name¶

AI Generative Reports for Dev Health Metrics

Summary¶

AI Generative Reports let users ask natural-language questions and receive generated reports, narrative analysis, live charts, and reusable reporting workflows based on their Dev Health data.

This is not just a chatbot on top of dashboards. It is a report-generation layer that composes metrics, trends, comparisons, explanations, and visuals from the underlying analytics engine.

Example:

Create a weekly report that shows cycle time, review bottlenecks, flaky test growth, and after-hours work for the platform team, with live charts and a summary of what changed.

Problem¶

The product can accumulate excellent metrics and still fail if users must manually assemble dashboards every week. Current analytics products often break at the last mile:

leaders need answers, not panels
teams need recurring summaries, not dashboard archaeology
chart creation is too manual
natural-language questions are not mapped cleanly to the metric model
generated insights often hallucinate or ignore actual data boundaries

Goals¶

Let users generate trustworthy, data-grounded reports from natural language
Automatically create charts, summaries, comparisons, and anomalies from live metrics
Support reusable scheduled reports without requiring dashboard authoring
Preserve strict traceability from narrative claims back to computed metrics
Reduce time-to-insight for leaders and managers

Non-Goals¶

Open-ended general chat unrelated to Dev Health data
Unbounded autonomous analysis across systems without guardrails
Replacing analysts for bespoke deep-dive work in v1
Generating unsupported conclusions without metric provenance

Target Users¶

Engineering executives
VPs and directors
EMs
Dev productivity teams
Program/release leads
Team leads
Individual developers who want personal or team summaries

Core User Stories¶

As a VP, I want a weekly engineering health summary with charts and narrative so I can review trends in minutes.
As an EM, I want to ask why cycle time worsened this week and get a grounded explanation with supporting metrics.
As a platform lead, I want a reusable report for CI instability and flaky tests by team every Monday morning.
As a team lead, I want a monthly health review covering throughput, durability, collaboration, and burnout indicators.
As a developer, I want to ask which code areas and PR patterns are creating the most rework in my team.

Product Principles¶

Grounded first: every narrative claim must map to actual metric outputs
Structured generation: LLM composes from validated metric payloads, not raw guesswork
Visual by default: charts and tables are first-class outputs
Reusable: any ad hoc prompt can become a saved report template
Explainable: generated output must show what data, time range, and filters were used

Supported Output Types¶

Executive summary
Weekly team report
Monthly business review
Incident/release health review
Delivery risk report
Quality trend report
Developer well-being summary
Custom ad hoc analysis
Slide-ready export
Markdown export
JSON/report spec export

Functional Requirements¶

1. Natural Language Query Interface¶

Users can ask prompts such as: - Create a weekly report for the platform team - Why did review time spike in March? - Show me teams with worsening build stability and burnout risk - Generate a monthly report on delivery vs durability for all backend teams

The system must: - parse entities, metrics, time windows, comparisons, and grouping instructions - resolve ambiguous prompt elements to known metric model concepts where possible - reject unsupported requests cleanly instead of inventing data

2. Report Planner¶

Convert user intent into a structured report plan: - audience - time range - scope - metrics requested - comparison periods - narrative sections - required charts - required insights - confidence/provenance requirements

3. Metrics Retrieval Layer¶

fetch only validated metrics from the core analytics system
support aggregation, ranking, slicing, filtering, and trend analysis
support cross-domain joins such as cycle time + flake rate + after-hours work

4. Chart Generator¶

Support auto-generation of: - line charts - bar charts - stacked composition charts - heatmaps - ranking tables - trend deltas - scorecards

Charts must be: - live against current data - configurable by team, repo, service, and time range - embeddable into saved reports

5. Narrative Generation¶

The system generates: - key findings - trend summaries - anomaly descriptions - comparison narratives - risk callouts - recommended next questions

Narrative must be constrained by: - available metrics only - explicit time window - known entity scope - thresholded confidence rules

6. Saved Reports¶

Users can: - save a generated report definition - rerun it on demand - parameterize team, repo, and date ranges - clone and modify templates

7. Scheduled Reports¶

Users can schedule recurring reports: - weekly - monthly - post-release - post-incident - end-of-sprint

Delivery targets can include: - in-app dashboard/report center - markdown export - email or Slack delivery in later phases

8. Recommended Reports¶

System suggests report templates such as: - Weekly Engineering Health - Team Delivery and Quality Review - CI Stability and Test Reliability - Burnout Risk and Flow State Trends - Release Readiness Overview

Example Report Structure¶

Weekly Engineering Health Report¶

Executive summary
Delivery trends
Quality and TestOps trends
Collaboration and review health
Well-being signals
Risks and anomalies
Recommended actions

UX Requirements¶

Prompt box with examples
Preview of parsed report scope before execution
Editable report outline
Live chart rendering in the report body
Ability to pin or remove sections
Provenance panel showing:
data sources used
metrics used
time range
filters applied
One-click save as template
One-click export to markdown

Trust and Guardrails¶

This feature will fail if it becomes a hallucination engine. Requirements:

narrative claims must reference computed metrics
unsupported metrics must be omitted or explicitly marked unsupported
every report must display time window and filter scope
every generated insight should have a confidence state:
direct metric fact
inferred from correlated signals
hypothesis needing further validation
no freeform recommendations without supporting evidence

Report DSL / Internal Spec¶

Every generated report should compile to an internal structured format, for example:

```yaml report_type: weekly_engineering_health scope: teams: ["platform"] time_range: start: 2026-04-01 end: 2026-04-07 sections: - summary - delivery - quality - testops - wellbeing charts: - metric: cycle_time type: line group_by: week - metric: flaky_test_rate type: bar group_by: repo insights: - trend_deltas - anomalies - top_risks