Investment API (Aggregation & Effort Weighting)¶

How persisted investment distributions become the numbers shown in the product. This is the read side of the Investment Categorization Pipeline; the tables it reads are described in the Investment Data Model.

Query code lives in src/dev_health_ops/api/queries/investment.py; the user-facing response builder is in src/dev_health_ops/api/services/investment.py.

The one thing to understand: values are effort-weighted¶

A WorkUnit stores a probability per theme/subcategory. The API does not report those probabilities directly — it multiplies each probability by the WorkUnit's effort_value and sums across WorkUnits:

-- fetch_investment_breakdown (simplified)
SELECT
    subcategory_kv.1 AS subcategory,
    splitByChar('.', subcategory_kv.1)[1] AS theme,
    sum(subcategory_kv.2 * effort_value) AS value
FROM latest_work_unit_investments AS work_unit_investments
ARRAY JOIN CAST(subcategory_distribution_json AS Array(Tuple(String, Float32))) AS subcategory_kv
WHERE from_ts < %(end_ts)s AND to_ts >= %(start_ts)s AND org_id = %(org_id)s
GROUP BY subcategory, theme
ORDER BY value DESC

So "60% Feature Delivery" means 60% of weighted effort, not 60% of tickets and not the average probability. A WorkUnit with a large effort_value moves the distribution far more than a tiny one.

What `effort_value` is¶

Computed at materialization time by _effort_from_work_unit, with this precedence:

Priority	Source	`effort_metric`
1	Commit churn (additions + deletions)	`churn_loc`
2	PR churn (additions + deletions)	`churn_loc`
3	Issue active hours	`active_hours`
4	none → `0.0`	`churn_loc`

A WorkUnit with effort_value == 0 contributes nothing to weighted aggregates even though it still has a valid distribution.

Query surfaces¶

Function	Shape	Weighting
`fetch_investment_breakdown`	subcategory + theme → value	`subcategory_prob * effort_value`
`fetch_investment_edges`	theme → repo/scope → value	`theme_prob * effort_value`
`fetch_investment_subcategory_edges`	subcategory → repo/scope → value	`subcategory_prob * effort_value`
`fetch_investment_team_edges`	subcategory → team → value	`subcategory_prob * effort_value`

All of them ARRAY JOIN the stored Map distribution into rows, filter by the time window (from_ts < end_ts AND to_ts >= start_ts) and org_id, then group.

Themes are derived in SQL from the subcategory prefix (splitByChar('.', subcategory)[1]), mirroring the deterministic roll-up done at compute time — the API never re-categorizes.

`unassigned` means missing scope, not a category¶

Some edge queries surface an unassigned scope label for WorkUnits with no resolved repo or team. The exact expression varies by query, e.g.:

-- fetch_investment_subcategory_edges: missing repo
ifNull(r.repo, if(repo_id IS NULL, 'unassigned', toString(repo_id))) AS target
-- fetch_investment_team_edges: missing team
ifNull(nullIf(unit_team.team, ''), 'unassigned') AS target

(fetch_investment_edges and the sunburst query instead use ifNull(r.repo, toString(repo_id)), which can fall back to the raw repo_id.) In every case unassigned is a scope/grouping label for a WorkUnit with no resolved repo/team — it is not an investment category and never appears in a theme or subcategory distribution. The categorization itself never returns "unknown" (see the pipeline guarantees).

Do not confuse this with the legacy rule-based classifier in analytics/investment.py. That path is explicitly deprecated/isolated to the pre-WorkUnit daily investment_* metric tables; it is not part of the canonical Investment View and now falls back to the legacy product/general bucket rather than an unassigned category.

Read semantics (latest row)¶

work_unit_investments is ReplacingMergeTree(computed_at), so duplicate physical rows for the same work_unit_id can exist until ClickHouse merges them. Investment API reads use an explicit latest-row subquery (argMax(..., computed_at) grouped by work_unit_id) before any effort-weighted sum(...). This gives API totals latest-row-by-computed_at semantics without requiring FINAL; see the data model read-semantics note.

Evidence quality stats¶

api/services/investment.py also surfaces evidence-quality statistics (the 0–1 score and its band) alongside distributions, and guards for missing tables/columns. Use these to convey confidence in the UI — a distribution dominated by low-quality or fallback WorkUnits should be presented with appropriate uncertainty (see the LLM Categorization Contract).