PRD: TestOps¶
Linear Milestone: https://linear.app/fullchaos/project/dev-health-ops-f947bce19f4c/overview
Product Name¶
TestOps for Dev Health Metrics
Summary¶
TestOps extends Dev Health Metrics beyond source control and workflow analytics into build, test, and release execution. It ingests fine-grained CI/CD and test telemetry, normalizes it, and turns it into team-level and system-level signals about build health, test effectiveness, release risk, and engineering drag.
The goal is not just to show pipeline numbers. The goal is to connect delivery performance to test quality, defect risk, and developer health.
Current State (Existing Infrastructure)¶
The platform already has partial CI/CD and delivery analytics that TestOps will extend:
| Capability | Status | Key Files |
|---|---|---|
| Pipeline run ingestion (status, duration, queue time) | Exists | connectors/github.py, connectors/gitlab.py, storage/mixins/cicd.py |
ClickHouse tables: ci_pipeline_runs, deployments, incidents |
Exists | migrations/clickhouse/000_raw_tables.sql |
| Pipeline metrics: success_rate, avg/p90 duration, queue minutes | Exists | metrics/compute_cicd.py |
| DORA metrics: deploy frequency, lead time, MTTR, change failure rate | Exists | metrics/job_dora.py, migrations/clickhouse/023b_dora_metrics.sql |
| Well-being metrics: after-hours ratio, weekend ratio | Exists | metrics/compute_wellbeing.py |
Schemas: PipelineRunRow, DeploymentRow, IncidentRow, DORAMetricsRecord |
Exists | metrics/schemas.py |
| Test execution models (suite, case, flakiness) | Does not exist | — |
| Code coverage ingestion and storage | Does not exist | — |
| Test-to-code ownership mapping | Does not exist | — |
| Flakiness detection and tracking | Does not exist | — |
TestOps extends this foundation. Epics should build on existing schemas and connectors, not replace them.
Problem¶
The current model covers delivery flow, code risk, collaboration, and cognitive load well, but it underweights the execution layer where engineering quality is either validated or exposed. Without TestOps:
- teams can ship fast while silently accumulating flaky tests and unstable pipelines
- leaders cannot distinguish slow delivery caused by review/process from slow delivery caused by broken CI
- test coverage is often reported as a vanity metric rather than a risk metric
- test failures, retry behavior, and flaky suites are not tied back to repos, teams, services, or changesets
- there is no reliable way to quantify quality drag on developer throughput
Goals¶
- Ingest CI/CD, test execution, and coverage data from major delivery systems
- Quantify pipeline health, test reliability, and release risk at org, team, repo, service, and PR levels
- Attribute build and test pain to teams, code areas, and change events
- Expose actionable insights, not just dashboards
- Integrate TestOps signals into Delivery, Durability, and Developer Well-being scores
Non-Goals¶
- Replace CI/CD vendors or test runners
- Act as a full test management system
- Author or orchestrate test execution directly in v1
- Provide root-cause analysis beyond supported correlations and heuristics
Target Users¶
- Engineering leaders
- Dev productivity / developer experience teams
- QA / quality engineering leaders
- Platform engineering teams
- EMs and tech leads
- Release managers
Core User Stories¶
- As an engineering leader, I want to see which teams lose the most time to CI instability so I know where platform investment is justified.
- As a dev productivity owner, I want to identify flaky tests by suite, owner, service, and impact so I can prioritize cleanup.
- As a team lead, I want to understand whether release risk is driven by low coverage, unstable tests, or recent defect escape.
- As a developer, I want to know whether my PR is likely to fail deployment based on historical signals.
- As a QA lead, I want to see whether increased test volume is improving confidence or just increasing runtime.
Key Concepts¶
- Pipeline Health: execution speed, reliability, retry burden, queueing, and failure modes
- Test Reliability: pass consistency, flakiness, quarantine rate, rerun dependence
- Coverage Quality: meaningful change coverage, not just aggregate line coverage
- Release Confidence: probability a change can move through CI/CD without regressions
- Quality Drag: time lost to failed pipelines, reruns, flaky tests, and blocked deploys
Data Sources¶
CI/CD Systems¶
- GitHub Actions (v1)
- GitLab CI (v1)
- Jenkins (v1 — abstraction layer shared with Buildkite)
- Buildkite (v1 — abstraction layer shared with Jenkins)
- CircleCI (v2 — deferred from initial milestone)
- Azure DevOps pipelines (v2 — deferred from initial milestone)
Test Frameworks and Artifacts¶
- JUnit/XML
- pytest
- Jest
- Playwright
- Cypress
- xUnit variants
- coverage reports such as lcov, Cobertura, JaCoCo
- Deployment systems where available
- Existing repo, PR, issue, and code ownership data in Dev Health Metrics
Functional Requirements¶
1. CI/CD Ingestion¶
- Ingest pipeline runs, jobs, stages, queue time, runtime, result, retry count, cancel reason, and trigger source
- Associate pipeline runs to commit, branch, PR/MR, repo, service, and team
- Support both polling and webhook ingestion patterns
- Handle backfills and incremental sync
2. Test Execution Ingestion¶
- Ingest test suite, test case, duration, status, retries, skipped/quarantined state, environment, and artifact links
- Track failures over time at test case and suite level
- Detect test ownership through repo/service/team mappings
3. Coverage Ingestion¶
- Ingest overall and changed-file coverage
- Track delta coverage on modified files and touched code paths
- Support branch, PR, repo, and service views
4. Metrics Engine¶
Compute at least the following:
Pipeline Metrics¶
- Pipeline success rate
- Pipeline failure rate
- Median pipeline duration
- P95 pipeline duration
- Queue time
- Rerun rate
- Cancel rate
- Deployment frequency
- Change failure rate
- MTTR for failed deploys
- Failed deployment recovery time
Test Metrics¶
- Test pass rate
- Test failure rate
- Flake rate
- Retry dependency rate
- Test suite duration
- Critical path test duration
- Slowest suites/tests
- Quarantined test count
- Failure recurrence score
Coverage Metrics¶
- Global coverage
- Changed-code coverage
- Critical-path coverage
- Coverage regression rate
- Uncovered change count
- Coverage-to-defect correlation
Derived Risk Metrics¶
- Release confidence score
- Test reliability index
- Pipeline stability index
- Quality drag hours
- Escaped defect risk
- Merge risk score
5. Entity-Level Views¶
Allow all metrics to be sliced by: - org - business unit - team - repo - service - application - branch - PR/MR - developer - date range - environment
6. Insights and Alerts¶
Surface insights such as: - 40% of failed pipelines in Team A are caused by 12 flaky end-to-end tests - Changed-code coverage dropped below threshold in the payments service for 3 consecutive weeks - Build queue time increased 2.3x after runner capacity saturation - Service X has the highest release risk due to low coverage plus high failure recurrence
7. Benchmarking¶
- Compare teams against internal baselines
- Compare current period vs prior period
- Support maturity bands such as stable / watch / degraded / critical
UX / Reporting Requirements¶
- TestOps dashboard with tabs for Pipelines, Tests, Coverage, Release Risk, and Cost of Quality
- Drill-down from org to team to repo to failing suite to failing test
- Time-series views with weekly and monthly trends
- Correlation panels:
- pipeline instability vs cycle time
- flake rate vs PR lead time
- changed-code coverage vs defect escape
- Heatmaps for flaky tests and unstable repos
- PR-level widget showing:
- changed-code coverage
- likely failing suites
- release confidence
- historical impact area risk
Scoring Model Integration¶
TestOps signals feed into the platform's analytical dimensions. These dimensions are currently implemented as:
- Backend-computed metrics in dedicated modules (compute_cicd.py, compute_wellbeing.py, quality.py, compute_ic.py)
- Frontend visualization zones in quadrantZones.ts that classify team operating state from raw metric inputs
TestOps adds new metric inputs to these existing computation and visualization layers:
- Delivery (backend:
compute_cicd.py,job_dora.py— already partially computed) - queue time (exists — extend with TestOps granularity)
- pipeline duration (exists — extend with per-stage breakdown)
- deploy frequency (exists via DORA)
- failed deployment recovery (exists via DORA MTTR)
- Durability (backend:
quality.py— new TestOps signals needed) - coverage quality (new)
- defect escape (new — correlation model)
- test reliability (new)
- release confidence (new — deterministic composite score)
- Developer Well-being (backend:
compute_wellbeing.py— extend with TestOps burden signals) - rerun burden (new)
- failed build interruption rate (new)
- after-hours recovery/retry behavior (extend existing after-hours ratio)
- Dynamics (frontend: quadrant zone classification — new TestOps zone inputs)
- team ownership of failures (new)
- quality burden concentration (new)
- cross-team dependency failures (new)
API Requirements¶
Example Objects¶
- pipeline_run
- pipeline_stage
- job_run
- test_case_result
- test_suite_result
- coverage_snapshot
- release_risk_score
Required API Capabilities¶
- fetch normalized metrics by dimension/entity/time window
- fetch raw execution events for drill-down
- fetch ranked insights and anomalies
- fetch scorecard summaries for dashboards and reports
Success Metrics¶
- % of active repos with CI/CD ingestion enabled
- % of active repos with test result ingestion enabled
- % of PRs with changed-code coverage
- reduction in flaky test volume
- reduction in rerun burden
- reduction in p95 pipeline duration
- improvement in release confidence
- decrease in failed deployment recovery time
Risks¶
- Coverage data quality is inconsistent across ecosystems
- CI vendor schemas vary and can be noisy
- Flake detection can create false positives if heuristics are simplistic
- Teams may game test quantity rather than test quality
- Attribution to team ownership can break in shared repos/services
Decisions (Resolved Open Questions)¶
| Question | Decision | Rationale |
|---|---|---|
| Which CI/CD systems are highest priority for v1? | GitHub Actions + GitLab CI + Jenkins/Buildkite. CircleCI and Azure DevOps deferred to v2. | GitHub and GitLab cover the majority of users. Jenkins/Buildkite share an abstraction layer. |
| Is changed-code coverage mandatory for v1 or v2? | v2. Global and per-file coverage in v1; changed-code coverage computation deferred to Phase 2. | Changed-code coverage requires diff-aware analysis that adds scope risk to the initial milestone. |
| How should we model monorepo ownership cleanly? | Use existing CODEOWNERS / team-to-path mappings from the entity resolution contract (Epic 1). Service boundaries inferred from path prefixes when CODEOWNERS is unavailable. | Avoids new ownership primitives; leverages existing team/repo/service entity mapping. |
| Should release confidence be deterministic, heuristic, or ML-assisted in v1? | Deterministic with documented weights. Heuristic or ML-assisted scoring deferred to Phase 3. | Deterministic models are explainable, auditable, and testable. Prerequisite for future ML. |
| Do we expose cost metrics (runner spend, wasted compute) in v1? | No. Deferred to Phase 3 (cost-of-quality analytics). | Requires CI vendor billing API integration not yet scoped. |
Open Questions (Remaining)¶
- How should flake detection handle environment-dependent failures vs true code flakes?
- Should test ownership attribution prefer CODEOWNERS, directory convention, or historical authorship?
- What is the minimum artifact retention policy for raw test results before rollup-only retention?
Phased Delivery¶
Phase 1¶
- CI ingestion
- basic test result ingestion
- pipeline and flake metrics
- dashboard and team views
Phase 2¶
- changed-code coverage
- release confidence
- risk scoring
- alerting and anomaly detection
Phase 3¶
- predictive failure models
- recommended remediation
- cost-of-quality analytics
- PR-level quality guidance
PRD: AI Generative Reports¶
Product Name¶
AI Generative Reports for Dev Health Metrics
Summary¶
AI Generative Reports let users ask natural-language questions and receive generated reports, narrative analysis, live charts, and reusable reporting workflows based on their Dev Health data.
This is not just a chatbot on top of dashboards. It is a report-generation layer that composes metrics, trends, comparisons, explanations, and visuals from the underlying analytics engine.
Example:
Create a weekly report that shows cycle time, review bottlenecks, flaky test growth, and after-hours work for the platform team, with live charts and a summary of what changed.
Problem¶
The product can accumulate excellent metrics and still fail if users must manually assemble dashboards every week. Current analytics products often break at the last mile:
- leaders need answers, not panels
- teams need recurring summaries, not dashboard archaeology
- chart creation is too manual
- natural-language questions are not mapped cleanly to the metric model
- generated insights often hallucinate or ignore actual data boundaries
Goals¶
- Let users generate trustworthy, data-grounded reports from natural language
- Automatically create charts, summaries, comparisons, and anomalies from live metrics
- Support reusable scheduled reports without requiring dashboard authoring
- Preserve strict traceability from narrative claims back to computed metrics
- Reduce time-to-insight for leaders and managers
Non-Goals¶
- Open-ended general chat unrelated to Dev Health data
- Unbounded autonomous analysis across systems without guardrails
- Replacing analysts for bespoke deep-dive work in v1
- Generating unsupported conclusions without metric provenance
Target Users¶
- Engineering executives
- VPs and directors
- EMs
- Dev productivity teams
- Program/release leads
- Team leads
- Individual developers who want personal or team summaries
Core User Stories¶
- As a VP, I want a weekly engineering health summary with charts and narrative so I can review trends in minutes.
- As an EM, I want to ask why cycle time worsened this week and get a grounded explanation with supporting metrics.
- As a platform lead, I want a reusable report for CI instability and flaky tests by team every Monday morning.
- As a team lead, I want a monthly health review covering throughput, durability, collaboration, and burnout indicators.
- As a developer, I want to ask which code areas and PR patterns are creating the most rework in my team.
Product Principles¶
- Grounded first: every narrative claim must map to actual metric outputs
- Structured generation: LLM composes from validated metric payloads, not raw guesswork
- Visual by default: charts and tables are first-class outputs
- Reusable: any ad hoc prompt can become a saved report template
- Explainable: generated output must show what data, time range, and filters were used
Supported Output Types¶
- Executive summary
- Weekly team report
- Monthly business review
- Incident/release health review
- Delivery risk report
- Quality trend report
- Developer well-being summary
- Custom ad hoc analysis
- Slide-ready export
- Markdown export
- JSON/report spec export
Functional Requirements¶
1. Natural Language Query Interface¶
Users can ask prompts such as: - Create a weekly report for the platform team - Why did review time spike in March? - Show me teams with worsening build stability and burnout risk - Generate a monthly report on delivery vs durability for all backend teams
The system must: - parse entities, metrics, time windows, comparisons, and grouping instructions - resolve ambiguous prompt elements to known metric model concepts where possible - reject unsupported requests cleanly instead of inventing data
2. Report Planner¶
Convert user intent into a structured report plan: - audience - time range - scope - metrics requested - comparison periods - narrative sections - required charts - required insights - confidence/provenance requirements
3. Metrics Retrieval Layer¶
- fetch only validated metrics from the core analytics system
- support aggregation, ranking, slicing, filtering, and trend analysis
- support cross-domain joins such as cycle time + flake rate + after-hours work
4. Chart Generator¶
Support auto-generation of: - line charts - bar charts - stacked composition charts - heatmaps - ranking tables - trend deltas - scorecards
Charts must be: - live against current data - configurable by team, repo, service, and time range - embeddable into saved reports
5. Narrative Generation¶
The system generates: - key findings - trend summaries - anomaly descriptions - comparison narratives - risk callouts - recommended next questions
Narrative must be constrained by: - available metrics only - explicit time window - known entity scope - thresholded confidence rules
6. Saved Reports¶
Users can: - save a generated report definition - rerun it on demand - parameterize team, repo, and date ranges - clone and modify templates
7. Scheduled Reports¶
Users can schedule recurring reports: - weekly - monthly - post-release - post-incident - end-of-sprint
Delivery targets can include: - in-app dashboard/report center - markdown export - email or Slack delivery in later phases
8. Recommended Reports¶
System suggests report templates such as: - Weekly Engineering Health - Team Delivery and Quality Review - CI Stability and Test Reliability - Burnout Risk and Flow State Trends - Release Readiness Overview
Example Report Structure¶
Weekly Engineering Health Report¶
- Executive summary
- Delivery trends
- Quality and TestOps trends
- Collaboration and review health
- Well-being signals
- Risks and anomalies
- Recommended actions
UX Requirements¶
- Prompt box with examples
- Preview of parsed report scope before execution
- Editable report outline
- Live chart rendering in the report body
- Ability to pin or remove sections
- Provenance panel showing:
- data sources used
- metrics used
- time range
- filters applied
- One-click save as template
- One-click export to markdown
Trust and Guardrails¶
This feature will fail if it becomes a hallucination engine. Requirements:
- narrative claims must reference computed metrics
- unsupported metrics must be omitted or explicitly marked unsupported
- every report must display time window and filter scope
- every generated insight should have a confidence state:
- direct metric fact
- inferred from correlated signals
- hypothesis needing further validation
- no freeform recommendations without supporting evidence
Report DSL / Internal Spec¶
Every generated report should compile to an internal structured format, for example:
```yaml report_type: weekly_engineering_health scope: teams: ["platform"] time_range: start: 2026-04-01 end: 2026-04-07 sections: - summary - delivery - quality - testops - wellbeing charts: - metric: cycle_time type: line group_by: week - metric: flaky_test_rate type: bar group_by: repo insights: - trend_deltas - anomalies - top_risks