Investment API (Aggregation & Effort Weighting)¶
How persisted investment distributions become the numbers shown in the product. This is the read side of the Investment Categorization Pipeline; the tables it reads are described in the Investment Data Model.
Query code lives in src/dev_health_ops/api/queries/investment.py; the user-facing
response builder is in src/dev_health_ops/api/services/investment.py.
The one thing to understand: values are effort-weighted¶
A WorkUnit stores a probability per theme/subcategory. The API does not report
those probabilities directly — it multiplies each probability by the WorkUnit's
effort_value and sums across WorkUnits:
-- fetch_investment_breakdown (simplified)
SELECT
subcategory_kv.1 AS subcategory,
splitByChar('.', subcategory_kv.1)[1] AS theme,
sum(subcategory_kv.2 * effort_value) AS value
FROM latest_work_unit_investments AS work_unit_investments
ARRAY JOIN CAST(subcategory_distribution_json AS Array(Tuple(String, Float32))) AS subcategory_kv
WHERE from_ts < %(end_ts)s AND to_ts >= %(start_ts)s AND org_id = %(org_id)s
GROUP BY subcategory, theme
ORDER BY value DESC
So "60% Feature Delivery" means 60% of weighted effort, not 60% of tickets and not
the average probability. A WorkUnit with a large effort_value moves the distribution
far more than a tiny one.
What effort_value is¶
Computed at materialization time by _effort_from_work_unit, with this precedence:
| Priority | Source | effort_metric |
|---|---|---|
| 1 | Commit churn (additions + deletions) | churn_loc |
| 2 | PR churn (additions + deletions) | churn_loc |
| 3 | Issue active hours | active_hours |
| 4 | none → 0.0 |
churn_loc |
A WorkUnit with effort_value == 0 contributes nothing to weighted aggregates even
though it still has a valid distribution.
Query surfaces¶
| Function | Shape | Weighting |
|---|---|---|
fetch_investment_breakdown |
subcategory + theme → value | subcategory_prob * effort_value |
fetch_investment_edges |
theme → repo/scope → value | theme_prob * effort_value |
fetch_investment_subcategory_edges |
subcategory → repo/scope → value | subcategory_prob * effort_value |
fetch_investment_team_edges |
subcategory → team → value | subcategory_prob * effort_value |
All of them ARRAY JOIN the stored Map distribution into rows, filter by the time
window (from_ts < end_ts AND to_ts >= start_ts) and org_id, then group.
Themes are derived in SQL from the subcategory prefix
(splitByChar('.', subcategory)[1]), mirroring the deterministic roll-up done at
compute time — the API never re-categorizes.
unassigned means missing scope, not a category¶
Some edge queries surface an unassigned scope label for WorkUnits with no resolved
repo or team. The exact expression varies by query, e.g.:
-- fetch_investment_subcategory_edges: missing repo
ifNull(r.repo, if(repo_id IS NULL, 'unassigned', toString(repo_id))) AS target
-- fetch_investment_team_edges: missing team
ifNull(nullIf(unit_team.team, ''), 'unassigned') AS target
(fetch_investment_edges and the sunburst query instead use ifNull(r.repo, toString(repo_id)),
which can fall back to the raw repo_id.) In every case unassigned is a scope/grouping
label for a WorkUnit with no resolved repo/team — it is not an investment category and
never appears in a theme or subcategory distribution. The categorization itself never returns
"unknown" (see the pipeline guarantees).
Do not confuse this with the legacy rule-based classifier in
analytics/investment.py. That path is explicitly deprecated/isolated to the pre-WorkUnit dailyinvestment_*metric tables; it is not part of the canonical Investment View and now falls back to the legacyproduct/generalbucket rather than anunassignedcategory.
Read semantics (latest row)¶
work_unit_investments is ReplacingMergeTree(computed_at), so duplicate physical rows
for the same work_unit_id can exist until ClickHouse merges them. Investment API reads
use an explicit latest-row subquery (argMax(..., computed_at) grouped by
work_unit_id) before any effort-weighted sum(...). This gives API totals
latest-row-by-computed_at semantics without requiring FINAL; see the
data model read-semantics note.
Evidence quality stats¶
api/services/investment.py also surfaces evidence-quality statistics (the 0–1 score
and its band) alongside distributions, and guards for missing tables/columns. Use these
to convey confidence in the UI — a distribution dominated by low-quality or fallback
WorkUnits should be presented with appropriate uncertainty (see the
LLM Categorization Contract).