Investment Categorization Pipeline¶
How a piece of work is assigned an investment categorization, end to end.
This is the compute-time pipeline. It runs during dev-hops investment materialize
(see Investment Materialization), persists
distributions to ClickHouse (see Investment Data Model),
and is read back — effort-weighted — by the API
(see Investment API).
For the strict LLM input/output schema, see the LLM Categorization Contract. For the canonical themes and subcategories, see the Investment Taxonomy.
Mental model. An individual issue, PR, or commit is never tagged with a category. The unit of categorization is a WorkUnit (a cluster of related items), and the output is a probability distribution over a fixed taxonomy — not a single label. Themes are a deterministic roll-up of subcategory probabilities; the LLM never picks a theme directly.
Code lives in src/dev_health_ops/work_graph/investment/.
Pipeline at a glance¶
flowchart TD
A[work_graph_edges] --> B[Build connected components]
B --> C[Per component: build text evidence bundle]
C --> D{Enough evidence?}
D -- no --> F[Deterministic fallback prior]
D -- yes --> E[LLM -> 15-subcategory distribution]
E --> V{Strict validation}
V -- ok --> G[Normalize subcategory vector]
V -- fail --> R[One repair re-prompt]
R -- ok --> G
R -- fail --> F
F --> G
G --> H[Deterministic roll-up to 5 themes]
H --> I[Compute evidence quality + effort value]
I --> J[(Persist: work_unit_investments + quotes)]
Step 1 — Form the WorkUnit¶
materialize_investments reads typed edges from work_graph_edges via
fetch_work_graph_edges and builds connected components with _build_components.
Each component is one WorkUnit: a set of linked issue / pr / commit nodes.
The WorkUnit id is a stable SHA-256 of its sorted node tokens (work_unit_id).
Categorization therefore happens on the cluster, not on the individual item.
Correction — components are not window-bounded.
fetch_work_graph_edgesfilters by repo / org only, not by time. Components are built from the full edge set, and the requested time window (--from/--to) is applied afterward: a component is skipped only if its node time-bounds (compute_time_bounds) fall entirely outside the window. Do not describe components as "built from edges inside the window" — a long-lived component can span many periods. (Whether that is the desired product behavior is tracked separately as an engineering question.)
Step 2 — Build the text evidence bundle¶
For each WorkUnit, build_text_bundle (in evidence.py) gathers bounded text:
| Source | Max items | Fields used |
|---|---|---|
| Issues | 6 | title, description, type, labels, parent title, epic title |
| PRs | 6 | title, body |
| Commits | 12 | subject line (first non-empty line) |
Each field is truncated to 280 chars and each source to 900 chars. The result
is a source_block (the text shown to the LLM) plus a per-source map (source_texts)
used later to verify quotes, and an input_hash (SHA-256 of the serialized sources)
persisted for audit.
Step 3 — Gate: LLM vs deterministic fallback¶
Before any LLM call, materialize_investments decides per component:
text_char_count < MIN_EVIDENCE_CHARS→ fallback (insufficient_evidence)text_source_count == 0→ fallback (no_text_sources)- otherwise → send to the LLM
This keeps cost down and avoids asking the model to categorize empty clusters.
Step 4 — LLM assigns subcategory probabilities¶
categorize_text_bundle (in categorize.py) sends the canonical prompt and expects a
probability distribution across all 15 subcategories (summing to 1), plus 1–10
extractive evidence quotes and a short uncertainty string. The 15 subcategories
are the fixed registry in investment_taxonomy.py
(see Investment Taxonomy).
This is the step that makes the determination. Everything downstream is deterministic.
Step 5 — Strict validation, one repair, deterministic fallback¶
validate_llm_payload (in llm_schema.py) enforces:
- top-level keys are exactly
subcategories,evidence_quotes,uncertainty; - every subcategory key is in the canonical set; each probability in
[0, 1]; - the distribution sums within
[0.9, 1.1]— a clean[0.98, 1.02]sum is accepted as-is, a near-miss is renormalized and flaggedprobability_sum_renormalizedin the audit, and≤ 0or outside[0.9, 1.1]is rejected; - each evidence quote is a literal substring of the provided source text
(anti-hallucination), 1–10 quotes,
source ∈ {issue, pr, commit}; uncertaintyis non-empty and ≤ 280 chars.
On failure, exactly one repair re-prompt is attempted (the validation errors are
fed back). If it still fails, a deterministic fallback distribution is applied with
status invalid_llm_output.
The fallback is a neutral prior, not "unknown."
FALLBACK_PRIORspreads weight evenly across one representative subcategory per theme (and zeroes the rest). It satisfies the "never unknown" contract, but semantically it means "insufficient validated evidence to assign a confident mix," not a meaningful estimate. Treat lowevidence_qualityand a fallbackcategorization_statusas low-confidence signals in any UX.
Possible categorization_status values: ok, repaired, invalid_llm_output,
insufficient_evidence, no_text_sources, and llm_task_failed (the async LLM task
raised before an outcome was recorded).
Step 6 — Deterministic theme roll-up¶
Themes are never chosen by the LLM. rollup_subcategories_to_themes sums
subcategory probabilities by their prefix (operational.on_call → operational) and
normalizes across the 5 themes. The subcategory vector is first filled out to all 15
keys and normalized via ensure_full_subcategory_vector. Pure arithmetic — no model
involved. This is what prevents category drift.
Step 7 — Evidence quality and effort value¶
compute_evidence_quality (in evidence.py) emits a 0–1 score:
0.4 * text_score + 0.3 * source_agreement + 0.3 * structural_density
where text_score reflects how much text was available, source_agreement rewards
having more than one source type (issue/pr/commit), and structural_density combines
graph density with average edge confidence. It is banded into
high / moderate / low / very_low.
Separately, _effort_from_work_unit computes the effort value used later for
weighting (see Investment API). Precedence:
- commit churn (additions + deletions) → metric
churn_loc - else PR churn → metric
churn_loc - else issue active hours → metric
active_hours - else
0.0
Step 8 — Persist¶
A WorkUnitInvestmentRecord is written per WorkUnit to work_unit_investments with the
theme distribution, subcategory distribution, structural evidence, evidence quality,
effort metric/value, and audit fields (categorization_status,
categorization_model_version, categorization_input_hash, categorization_run_id,
computed_at).
Evidence quotes are written to work_unit_investment_quotes by default for CLI and
worker materialization runs. They can be skipped with
--no-persist-evidence-snippets for storage-constrained backfills. See
Investment Data Model for table schemas and read semantics.
Guarantees¶
For every WorkUnit:
- theme probabilities sum to ~1.0;
- subcategory probabilities sum to the theme probabilities;
- evidence arrays exist (may be empty);
- evidence quality is always emitted;
- categorization never returns "unknown".
What this pipeline does not do¶
- It does not tag individual issues/PRs/commits.
- It does not let the LLM choose themes or invent categories.
- It does not recompute anything at UX-time — explanations read persisted data only (see the LLM Categorization Contract).
- It does not apply effort weighting here; weighting happens at read time in the API.